Convergence Analysis of Inexact Randomized Iterative Methods

In this paper we present a convergence rate analysis of inexact variants of several randomized iterative methods. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic subspace ascent. A common feature of these methods is that in their update rule a certain sub-problem needs to be solved exactly. We relax this requirement by allowing for the sub-problem to be solved inexactly. In particular, we propose and analyze inexact randomized iterative methods for solving three closely related problems: a convex stochastic quadratic optimization problem, a best approximation problem and its dual, a concave quadratic maximization problem. We provide iteration complexity results under several assumptions on the inexactness error. Inexact variants of many popular and some more exotic methods, including randomized block Kaczmarz, randomized Gaussian Kaczmarz and randomized block coordinate descent, can be cast as special cases. Numerical experiments demonstrate the benefits of allowing inexactness.


Introduction
In the era of big data where data sets become continuously larger, randomized iterative methods become very popular and they are now playing major role in areas like numerical linear algebra, scientific computing and optimization. They are preferred mainly because of their cheap per iteration cost which leads to the improvement in terms of complexity upon classical results by orders of magnitude and to the fact that they can easily scale to extreme dimensions. However, a common feature of these methods is that in their update rule a particular subproblem needs to be solved exactly. In the case that the size of this problem is large, this step can be computationally very expensive. The purpose of this work is to reduce the cost of this step by incorporating inexact updates in the stochastic methods under study.

The Setting
In this paper we are interested to solve three closely related problems: • Stochastic Quadratic Optimization Problem • Best Approximation Problem

• Concave Quadratic Maximization Problem
We start by presenting the main connections and key relationships between these problems as well as popular randomized iterative methods (with exact updates) for solving each one of them.
Stochastic Optimization Problem: We study the stochastic quadratic optimization problem first proposed in [46] for reformulating consistent linear systems In particular, problem (1) is defined by setting: where H is a random symmetric positive semi-definite matrix H := S(S AB −1 A S) † S that depends on three different matrices: the data matrix A ∈ R m×n of the linear system (2), a random matrix S ∈ R m×q ∼ D and on an n × n positive definite matrix B which defines the geometry of the space. Throughout the paper, B is used to define a B−inner product in R n via x, z B := Bx, z and an induced B−norm, x B := (x Bx) 1/2 . By † we denote the Moore-Penrose pseudoinverse. The expectation in (1) is over random matrices S with m rows (and arbitrary number of columns q, e.g., q = 1) drawn from an arbitrary (user defined) distribution D. The authors of [46] give necessary and sufficient conditions that distribution D needs to be satisfied for the set of solutions of (1) to be equal to the set of solutions of the linear system (2); a property for which the term exactness was coined in (see Section 3 for more details on exactness).
In [46], problem (1) was solved via Stochastic Gradient Descent (SGD) 1 : and a linear rate of convergence was proved despite the fact that f is not necessarily strongly convex, (1) is not a finite-sum problem and a fixed stepsize ω > 0 is used. The stochastic optimization problem (1) has many unique characteristics mainly because it has constructed in a particular way in order to capture all the information of the linear system (2). For example it holds that f S (x) = 1 2 ∇f S (x) 2 B , and it can be proved that all eigenvalues of its Hessian matrix ∇ 2 f (x) are upper bounded by 1. Due to these specific characteristics, the update rules of seemingly different randomized iterative methods are identical. In particular the following methods for solving (1) have exactly the same behavior with SGD [46]: Stochastic Proximal Point Method (SPPM) 3 : x k+1 = arg min x∈R n f S k (x) + In all methods ω > 0 is a fixed stepsize and S k is sampled afresh in each iteration from distribution D. See [46] for more insights into the reformulation (1), its properties and other equivalent reformulations (e.g., stochastic fixed point problem, probabilistic intersection problem, and stochastic linear system).
Best Approximation Problem and Sketch and Project Method: In [46,29], it has been shown that for the case of consistent linear systems with multiple solutions, SGD (and as a result SNM (5) and SPPM (6)) converges linearly to one particular minimizer of function f , the projection of the initial iterate x 0 onto the solution set of the linear system (2). This naturally leads to the best approximation problem: min x∈R n P (x) := 1 2 x − x 0 2 B subject to Ax = b.
1 The gradient is computed with respect to the inner product Bx, y . 2 In this method we take the B-pseudoinverse of the Hessian of f S k instead of the classical inverse, as the inverse does not exist. When B = I, the B pseudoinverse specializes to the standard Moore-Penrose pseudoinverse. 3 In this case, the equivalence only works for 0 < ω ≤ 1.
Unlike, the linear system (2) which is allowed to have multiple solutions, the best approximation problem has always (from its construction) a unique solution. For solving problem (7), the Sketch and Project Method (SPM): was analyzed in [18,46]. Here, Π L S k ,B (x k ) denotes the projection of point x k onto L S k = {x ∈ R n : S k Ax = S k b} in the B-norm. In the special case of unit stepsize (ω = 1) algorithm (8) simplifies to x k+1 = Π L S ,B (x k ), (9) first proposed in [18]. The name Sketch and Project method is justified by the iteration structure which follows two steps: (i) Choose the sketched system L S k := {x : S Ax = S b}, (ii) Project the last iterate x k onto L S k . The Sketch and Project viewpoint will be useful later in explaining the natural interpretation of the proposed inexact update rules. (see Section 4.2).
Dual Problem and SDSA: The Fenchel dual of (7) is the (bounded) unconstrained concave quadratic maximization problem Boundedness follows from consistency. It turns out that by varying A, B and b (but keeping consistency of the linear system), the dual problem in fact captures all bounded unconstrained concave quadratic maximization problems [29]. A direct dual method for solving problem (10) was first proposed in [19]. The dual method-Stochastic Dual Subspace Ascent (SDSA)-updates the dual vectors y k as follows: where the random matrix S k is sampled afresh in each iteration from distribution D, and λ k is chosen in such a way to maximize the dual objective D: λ k ∈ arg max λ D(y k + S k λ). More specifically, SDSA is defined by picking the λ k with the smallest (standard Euclidean) norm. This leads to the formula: It can be proved, [19,29], that the iterates {x k } k≥0 of the sketch and project method (8) arise as affine images of the iterates {y k } k≥0 of the dual method (11) as follows: In [19] the dual method was analyzed for the case of unit stepsize (ω = 1). Later in [29] the analysis extended to capture the cases of ω ∈ (0, 2). Momentum variants of the dual method that provide further speed up have been also studied in [29].
An interesting property that holds between the suboptimalities of the Sketch and Project method and SDSA is that the dual suboptimality of y in terms of the dual function values is equal to the primal suboptimality of x(y) in terms of distance [19,29]. That is, This simple to derive result (by combining the expression of the dual function D(y) (10) and the equation (13)) gives for free the convergence analysis of SDSA, in terms of dual function suboptimality once the analysis of Sketch and Project is available (see Section 5).

Contributions
In this work we propose and analyze inexact variants of all previously mentioned randomized iterative algorithms for solving the stochastic optimization problem, the best approximation problem and the dual problem. In all of these methods, a certain potentially expensive calculation/operation needs to be performed in each step; it is this operation that we propose to be performed inexactly. For instance, in the case of SGD, it is the computation of the stochastic gradient ∇f S k (x k ), in the case of SPM is the computation of the projection Π L S ,B (x k ), and in the case of SDSA it is the computation of the dual update S k λ k . We perform an iteration complexity analysis under an abstract notion of inexactness and also under a more structured form of inexactness appearing in practical scenarios. An inexact solution of these subproblems can be obtained much more quickly than the exact solution. Since in practical applications the savings thus obtained are larger than the increase in the number of iterations needed for convergence, our inexact methods can be dramatically faster.
Let us now briefly outline the rest of the paper: In Section 2 we describe the subproblems and introduce two notions of inexactness (abstract and structured) that will be used in the rest of the paper. The Inexact Basic Method (iBasic) is also presented. iBasic is a method that simultaneously captures inexact variants of the algorithms (4), (5), (6) for solving the stochastic optimization problem (1) and algorithm (8) for solving the best approximation problem (7). It is an inexact variant of the Basic Method, first presented in [46], where the inexactness is introduced by the addition of an inexactness error k in the original update rule. We illustrate the generality of iBasic by presenting popular algorithms that can be cast as special cases.
In Section 3 we establish convergence results of iBasic under general assumptions on the inexactness error k of its update rule (see Algorithm 1). In this part we do not focus on any specific mechanisms which lead to inexactness; we treat the problem abstractly. However, such errors appear often in practical scenarios and can be associated with inaccurate numerical solvers, quantization, sparsification and compression mechanisms. In particular, we introduce several abstract assumptions on the inexactness level and describe our generic convergence results. For all assumptions we establish linear rate of decay of the quantity E[ x k − x * 2 B ] (i.e. L2 convergence) 4 . Subsequently, in Section 4 we apply our general convergence results to a more structured notion of inexactness error and propose a concrete mechanisms leading to such errors. We provide theoretical guarantees for this method in situations when a linearly convergent iterative method (e.g., Conjugate Gradient) is used to solve the subproblem inexactly. We also highlight the importance of the dual viewpoint through a sketch-and-project interpretation.
In Section 5 we study an inexact variant of SDSA, which we called iSDSA, for directly solving the dual problem (10). We provide a correspondence between iBasic and iSDSA and we show that the random iterates of iBasic arise as affine images of iSDSA. We consider both abstract and structured inexactness errors and provide linearly convergent rates in terms of the dual function Finally, in Section 6 we evaluate the performance of the proposed inexact methods through numerical experiments and show the benefits of our approach on both synthetic and real datasets. Concluding remarks are given in Section 7.
A summary of the convergence results of iBasic under several assumptions on the inexactness error with pointers to the relevant theorems is available in Table 1. We highlight that similar convergence results can be also obtained for iSDSA in terms of the dual function suboptimality E [D(y * ) − D(y 0 )] (check Section 5 for more details on iSDSA).

Notation
For convenience, a table of the most frequently used notation is included in the Appendix C. In particular, with boldface upper-case letters we denote matrices and I is the identity matrix. By L Assumption on the Inexactness error k ω Upper Bounds Theorem Table 1: Summary of the iteration complexity results obtained in this paper. ω denotes the stepsize (relaxation parameter) of the method. In all cases, x * = Π L,B (x 0 ) and ρ = 1 − ω(2 − ω)λ + min ∈ (0, 1) are the quantities appear in the convergence results (here λ + min denotes the minimum non zero eigenvalue of matrix W, see equation (19)). Inexactness parameter q is chosen always in such a way to obtain linear convergence and it can be seen as the quantity that controls the inexactness. In all theorems the quantity of convergence is we denote the solution set of the linear system Ax = b. By L S , where S is a random matrix, we denote the solution set of the sketched linear system S Ax = S b. In general, we use · * to express the exact solution of a sub-problem and · ≈ to indicate its inexact variant. Unless stated otherwise, throughout the paper, x * is the projection of x 0 onto L in the B-norm: x * = Π L,B (x 0 ). An explicit formula for the projection of point x onto set L is given by A formula for the projection onto L S = {x ∈ R n : S Ax = S b} is obtained by replacing A and b with S A and S b respectively into the above equation. We denote this projection by Π L S ,B (x). We also write [n] := {1, 2, . . . , n}.
In order to keep the expression brief throughout the paper we define 5 : Using this matrix we can easily express important quantities related to the problems under study. For example the stochastic functions f S of problem (1) can be expressed as In addition the gradient and the Hessian of f S with respect to the B inner product are equal to and 46]. Similarly the gradient and Hessian of the objective function f of A key matrix in our analysis is which has the same spectrum with the matrix ∇ 2 f (x) but at the same time is symmetric and positive semi-definite 6 . We denote with λ 1 ≤ λ 2 ≤ · · · ≤ λ n the n eigenvalues of W. With λ + min we indicate the smallest nonzero eigenvalue, and with λ max = λ n the largest eigenvalue. It was shown in [46] that 0 ≤ λ i ≤ 1 for all i ∈ [n].

Inexact update rules
In this section we start by explaining the key sub-problems that need to be solved exactly in the update rules of the previously described methods. We present iBasic, a method that solves problems (1) and (7) and we show how by varying the main parameters of the method we recover inexact variants of popular algorithms as special cases. Finally closely related work on inexact algorithms for solving different problems is also presented.

Expensive Sub-problems in Update Rules
Let us devote this subsection on explaining how the inexactness can be introduced in the current exact update rules of SGD 7 (4), Sketch and Project (8) and SDSA (11) for solving the stochastic optimization, best approximation and the dual problem respectively. As we have shown these methods solve closely related problems and the key subproblems in their update rule are similar. However the introduction of inexactness in the update rule of each one of them can have different interpretation.
For example for the case of SGD for solving the stochastic optimization problem (1) (see also Section 4.1 and 4.2 for more details), if we define k and the update rule of SGD takes the form: Clearly in this update the expensive part is the computation of the quantity λ * k that can be equivalently computed to be the least norm solution of the smaller (in comparison to Ax = b) linear system S k AB −1 A S k λ = S k (b − Ax k ). In our work we are suggesting to use an approximation λ ≈ k of the exact solution and with this way avoid executing the possibly expensive step of the update rule. Thus the inexact update is taking the following form: Here k denotes a more abstract notion of inexactness and it is not necessary to be always equivalent to the quantity ωB −1 A S k (λ ≈ k − λ * k ). It can be interpreted as an expression that acts as an perturbation of the exact update. In the case that k has the above form we say that the notion of inexactness is structured. In our work we are interested in both the abstract and more structured notions of inexactness. We first present general convergence results where we require the error k to satisfy general assumptions (without caring how this error is generated) and later we analyze the concept of structured inexactness by presenting algorithms where k = ωB −1 A S k (λ ≈ k − λ * k ). In similar way, the expensive operation of SPM (8) is the exact computation of the projection Π * L S k ,B (x k ). Thus we are suggesting to replace this step with an inexact variant and compute an approximation of this projection. The inexactness here can be also interpreted using both, the abstract k error and its more structured version . At this point, observe that, by using the expression (15) the structure of the k in SPM and SGD has the same form.
In the SDSA the expensive subproblem in the update rule is the computation of the λ * k that satisfy λ * k ∈ arg max λ D(y k + S k λ). Using the definition of the dual function (10) this value can be also computed by evaluating the least norm solution of the linear system S k AB −1 A S k λ = S k b − A(x 0 + B −1 A y k ). Later in Section 5 we analyze both notions of inexactness (abstract and more structured) for inexact variants of SDSA. Table 2 presents the key sub-problem that needs to be solved in each algorithm as well as the part where the inexact error is appeared in the update rule.

Exact Algorithms
Key Subproblem (problem that we solve inexactly) Inexact Update Rules (abstract and structured inexactness error)

SGD (4)
Exact computation of λ * k , where λ * k = arg min λ:Mkλ=dk λ . Appears in the computation of ∇f Sk ( Table 2: The exact algorithms under study with the potentially expensive to compute key sub-problems of their update rule. The inexact update rules are presented in the last column for both notions of inexactness (abstract and more structured). We use · * to define the important quantity that needs to be computed exactly in the update rule of each method and · ≈ to indicate the proposed inexact variant.

The Inexact Basic Method
In each iteration of the all aforementioned exact methods a sketch matrix S ∼ D is drawn from a given distribution and then a certain subproblem is solved exactly to obtain the next iterate. The sketch matrix S ∈ R m×q requires to have m rows but no assumption on the number of columns is made which means that the number of columns q allows to vary through the iterations and it can be very large. The setting that we are interested in is precisely that of having such large random matrices S. In these cases we expect that having approximate solutions of the subproblems will be beneficial.
Recently randomized iterative algorithms that requires to solve large subproblems in each iteration have been extensively studied and it was shown that are really beneficial when they compared to their single coordinates variants (S ∈ R m×1 ) [34,35,44,27]. However, in theses cases the evaluation of an exact solution for the suproblem in the update rule can be computationally very expensive. In this work we propose and analyze inexact variants by allowing to solve the subproblem that appear in the update rules of the stochastic methods, inexactly. In particular, following the convention established in [46] of naming the main algorithm of the paper Basic method we propose the inexact Basic method (iBasic) (Algorithm 1).

Algorithm 1 Inexact Basic Method (iBasic)
Input: Distribution D from which we draw random matrices S, positive definite matrix B ∈ R n×n , stepsize ω > 0.
Generate a fresh sample S k ∼ D 3: The k in the update rule of the method represents the abstract inexactness error described in Subsection 2.1. Note that, iBasic can have several equivalent interpretations. This allow as to study the methods (4),(5),(6) for solving the stochastic optimization problem and the sketch and project method (8) for the best approximation problem in a single algorithm only. In particular iBasic can be seen as inexact stochastic gradient descent (iSGD) with fixed stepsize applied to (1). From (17), and as a result the update rule of iBasic can be equivalently written as: In the case of the best approximation problem (7), iBasic can be interpreted as inexact Sketch and Project method (iSPM) as follows: For the dual problem (10) we devote Section 5 for presenting an inexact variant of the SDSA (iSDSA) and analyze its convergence using the rates obtained for the iBasic in Sections 3 and 4.

General Framework and Further Special Cases
The proposed inexact methods, iBasic (Algorithm 1) and iSDSA (Section 5), belong in the general sketch and project framework, first proposed from Gower and Richtarik in [18] for solving consistent linear systems and where a unified analysis of several randomized methods was studied. This interpretation of the algorithms allow us to recover a comprehensive array of well-known methods as special cases by choosing carefully the combination of the main parameters of the algorithms.
In particular, the iBasic has two main parameters (besides the stepsize ω > 0 of the update rule). These are the distribution D from which we draw random matrices S and the positive definite matrix B ∈ R n×n . By choosing carefully combinations of the parameters D and B we can recover several existing popular algorithms as special cases of the general method. For example, special cases of the exact Basic method are the Randomized Kaczmarz, Randomized Gaussian Kaczmarz 8 , Randomized Coordinate Descent and their block variants. For more details about the generality of the sketch and project framework and further algorithms that can be cast as special cases of the analysis we refer the interested reader to Section 3 of [18] and Section 7 of [29]. Here we present only the inexact update rules of two special cases that we will later use in the numerical evaluation.
Special Cases: Let us define with I :C the column concatenation of the m × m identity matrix indexed by a random subset C of [m].
• Inexact Randomized Block Kaczmarz (iRBK): Let B = I and let pick in each iteration the random matrix S = I :C ∼ D. In this setup the update rule of the iBasic simplifies to • Inexact Randomized Block Coordinate Descent (iRBCD) 9 : If the matrix A of the linear system is positive definite then we can choose B = A. Let also pick in each iteration the random matrix S = I :C ∼ D. In this setup the update rule of the iBasic simplifies to For more papers related to Kaczmarz method (randomized, greedy, cyclic update rules) we refer the interested reader to [23,28,38,5,37,39,9,33,34,13,31,59,35,50]. For the coordinate descent method (a.k.a Gauss-Seidel for linear systems) and its block variant, Randomized Block Coordinate Descent we suggest [25,36,44,45,40,41,43,7,24,14,1,54].

Other Related Work on Inexact Methods
One of the current trends in the large scale optimization problems is the introduction of inexactness in the update rules of popular deterministic and stochastic methods. The rational behind this is that an approximate/inexact step can often computed very efficiently and can have significant computational gains compare to its exact variants.
In the area of deterministic algorithms, the inexact variant of the full gradient descent method, has received a lot of attention [49,11,51,16,32]. It has been analyzed for the cases of convex and strongly convex functions under several meaningful assumptions on the inexactness error k and its practical benefit compared to the exact gradient descent is apparent. For further deterministic inexact methods check [10] for Inexact Newton methods, [52,47] for Inexact Proximal Point methods and [3] for Inexact Fixed point methods.
In the recent years, with the explosion that happens in areas like machine learning and data science inexactness enters also the updating rules of several stochastic optimization algorithms and many new methods have been proposed and analyzed.
In the large scale setting, stochastic optimization methods are preferred mainly because of their cheap per iteration cost (compared to their deterministic variants), their property to scale to extreme dimensions and their improved theoretical complexity bounds. In areas like machine learning and data science, where the datasets become larger rapidly, the development of faster and efficient stochastic algorithms is crucial. For this reason, inexactness has recently introduced to the update rules of several stochastic optimization algorithms and new methods have been proposed and analyzed. One of the most interesting work on inexact stochastic algorithms appears in the area of second order methods. In particular on inexact variants of the Sketch-Newton method and subsampled Newton Method for minimize convex and non-convex functions [48,2,4,56,57,58]. Note that our results are related also with this literature since our algorithm can be seen as inexact stochastic Newton method (see equation (5)). To the best or our knowledge our work is the first that provide convergence analysis of inexact stochastic proximal point methods (equation (6)) in any setting. From numerical linear algebra viewpoint inexact sketch and project methods for solving the best approximation problem and its dual problem where also never analyzed before.
As we already mentioned our framework is quite general and many algorithms, like iRBK (21) and iRBCD (22) can be cast as special cases. As a result, our general convergence analysis includes the analysis of inexact variants of all of these more specific algorithms as special cases. In [34] an analysis of the exact randomized block Kacmzarz method has been proposed and in the experiments an inexact variant was used to speedup the method. However, no iteration complexity results were presented for the inexact variant and both the analysis and numerical evaluation have been made for linear systems with full rank matrices that come with natural partition of the rows (this is a much more restricted case than the one analyzed in our setting). For inexact variants of the randomized block coordinate descent algorithm in different settings than ours we suggest [53,15,6,12].
Finally an analysis of approximate stochastic gradient descent for solving the empirical risk minimization problem using quadratic constraints and sequential semi-definite programs has been presented in [22].

Convergence Results Under General Assumptions
In this section we consider scenarios in which the inexactness error k can be controlled, by specifying a per iteration bound σ k on the norm of the error. In particular, by making different assumptions on the bound σ k we derive general convergence rate results. Our focus is on the abstract notion of inexactness described in Section 2.1 and we make no assumptions on how this error is generated.
An important assumption that needs to be hold in all of our results is exactness. A formal presentation is presented below. We state it here and we highlight that is a requirement for all of our convergence results (exactness is also required in the analysis of the exact variants [46]).
Exactness. Note that f S is a convex quadratic, and that f S (x) = 0 whenever x ∈ L := {x : Ax = b}. However, f S can be zero also for points x outside of L. Clearly, f (x) is nonnegative, and f (x) = 0 for x ∈ L. However, without further assumptions, the set of minimizers of f can be larger than L. The exactness assumption ensures that this does not happen. For necessary and sufficient conditions for exactness, we refer the reader to [46]. Here it suffices to remark that a sufficient condition for exactness is to require E [H] to be positive definite. This is easy to see by observing . In other words, if X = argminf (x) is the solution set of the stochastic optimization problem (1) and L = {x : Ax = b} the solution set of the linear system (2) then the notion of exactness is captured by: X = L

Assumptions on Inexactness Error
In the convergence analysis of iBasic the following assumptions on the inexactness error are used. We note that Assumptions 1a, 1b and 1c are special cases of Assumption 1. Moreover Assumption 2 is algorithmic dependent and can hold in addition of any of the other four assumptions. In our analysis, depending on the result we aim at, we will require either one of the first four Assumptions to hold by itself, or to hold together with Assumption 2. We will always assume exactness.
In all assumptions the expectation on the norm of error ( k 2 ) is conditioned on the value of the current iterate x k and the random matrix S k . Moreover it is worth to mention that for the convergence analysis we never assume that the inexactness error has zero mean, that is E[ k ] = 0.
where the upper bound σ k is a sequence of random variables (that can possibly depends on both the value of the current iterate x k and the choice of the random S k at the k th iteration).
The following three assumptions on the sequence of upper bounds are more restricted however as we will later see allow us to obtain stronger and more controlled results.
where the upper bound σ k ∈ R is a sequence of real numbers.
where the upper bound is a special sequence that depends on a non-negative inexactness parameter q and the distance to the optimal value where the upper bound is a special sequence that depends on a non-negative inexactness parameter q and the value of the stochastic function f S k computed at the iterate x k .
Finally the next assumption is more algorithmic oriented. It holds in cases where the inexactness error k in the update rule is chosen to be orthogonal with respect to the B-inner product to the vector Π L S k ,B (x k ) − x * = (I − ωB −1 Z k )(x k − x * ). This statement may seem odd at this point but its usefulness will become more apparent in the next section where inexact algorithms with structured inexactness error will be analyzed. As it turns out, in the case of structured inexactness error (Algorithm 2) this assumption is satisfied.

Convergence Results
In this section we present the analysis of the convergence rates of iBasic by assuming several combination of the previous presented assumptions. All convergence results are described only in terms of convergence of the iterates x k , that is x k − x * 2 B , and not the objective function values f (x k ). This is sufficient, because by f (x) ≤ λmax 2 x − x * 2 B (see Lemma 10) we can directly deduce a convergence rate for the function values. The exact Basic method (Algorithm 1 with k = 0), has been analyzed in [46] and it was shown Our analysis of iBasic is more general and includes the convergence of the exact Basic method as special case when we assume that the upper bound is σ k = 0, ∀k ≥ 0. For brevity, in he convergence analysis results of this manuscript we also use ρ = 1 − ω(2 − ω)λ + min . Let us start by presenting the convergence of iBasic when only Assumption 1a holds for the inexactness error. Theorem 1. Let assume exactness and let {x k } ∞ k=0 be the iterates produced by iBasic with ω ∈ (0, 2). Set x * = Π L,B (x 0 ) and consider the error k be such that it satisfies Assumption 1a. Then, Proof. See Appendix B.1.

Corollary 1.
In the special case that the upper bound σ k in Assumption 1a is fixed, that is σ k = σ for all k > 0 then inequality (28) of Theorem 1 takes the following form: This means that we obtain a linear convergence rate up to a solution level that is proportional to the upper bound σ 10 .
Inspired from [16], let us now analyze iBasic using the sequence of upper bounds that described in Assumption 1b. This construction of the upper bounds allows us to obtain stronger and more controlled results. In particular using the upper bound of Assumption 1b the sequence of expected errors converge linearly to the exact x * (not in a potential neighborhood like the previous result). In addition Assumption 1b guarantees that the distance to the optimal solution reduces with the increasing of the number of iterations. However for this stronger convergence a bound for λ + min is required, a quantity that in many problems is unknown to the user or intractable to compute. Nevertheless, there are cases that this value has a close form expression and can be computed before hand without any further cost. See for example [27,30,26,21] where methods for solving the average consensus were presented and the value of λ + min corresponds to the algebraic connectivity of the network under study.
At Theorem 2, to guarantee linear convergence the inexact parameter q should live in the interval 0, 1 − √ ρ . In particular, q is the parameter that controls the level of inexactness of Algorithm 1. Not surprisingly the fastest convergence rate is obtained when q = 0; in such case the method becomes equivalent with its exact variant and the convergence rate simplifies to ρ = 1−ω(2−ω)λ + min . Note also that similar to the exact case the optimal convergence rate is obtained for ω = 1 [46].
Moreover, the upper bound σ k of Assumption 1b depends on two important quantities, the λ + min (through the upper bound of the inexactness parameter q) and the distance to the optimal solution x k − x * to be large either when the current iterate is far from the optimal solution ( x k − x * 2 B large) or when the problem is well conditioned and λ + min is large. In the opposite scenario, when we have ill conditioned problem or we are already close enough to the optimum x * we should be more careful and allow less errors to the updates of the method.
In the next theorem we provide the complexity results of iBasic in the case that the Assumption 2 is satisfied combined with one of the previous assumptions.

Remark 1.
In the case that Assumptions 1a and 2 hold simultaneously, the convergence of iBasic is similar to (31) but in this caseσ 2 i = σ 2 i , ∀i ∈ [k − 1] (due to Assumption 1a, σ k ∈ R is a sequence of real numbers). In addition, note that for q ∈ (0, min{ √ ρ, 1 − √ ρ}) having Assumption 2 on top of Assumption 1b leads to improvement of the convergence rate. In particular, from Theorem 2, iBasic converges with rate ( √ ρ+q) 2 = ρ+q 2 +2 √ ρq while having both assumptions this is simplified to the faster ρ + q 2 (32).

iBasic with Structured Inexactness Error
Up to this point, the analysis of iBasic was focused in more general abstract cases where the inexactness error k of the update rule satisfies several general assumptions. In this section we are focusing on a more structured form of inexactness error and we provide convergence analysis in the case that a linearly convergent algorithm is used for the computation of the expensive key subproblem of the method.

Linear System in the Update Rule
As we already mentioned in Section 2.1 the update rule of the exact Basic method (Algorithm 1 with . Using this expression the exact Basic method can be equivalently interpreted as the following two step procedure: 1. Find the least norm solution 11 11 We are precisely looking for the least norm solution of the linear system M k λ = d k because this solution can be written down in a compact way using the Moore-Penrose pseudoinverse. This is equivalent with the expression that appears in our update: However it can be easily shown that the method will still converge with the same rate of convergence even if we choose any other solution of the linear system M k λ = d k .
2. Compute the next iterate: In the case that the random matrix S k is large (this is the case that we are interested in), solving exactly the linear system M k λ = d k in each step can be prohibitively expensive. To reduce this cost we allow the inner linear system M k λ = d k to be solved inexactly using an iterative method. In particular we propose and analyze the following inexact algorithm:

Algorithm 2 iBasic with structured inexactness error
Input: Distribution D from which we draw random matrices S, positive definite matrix B ∈ R n×n , stepsize ω > 0. Initialize: x 0 ∈ R n 1: for k = 1, 2, · · · do 2: Generate a fresh sample S k ∼ D

3:
Using an iterative method compute an approximation λ ≈ k of the least norm solution of the linear system:

4:
Set For the computation of the inexact solution of the linear system (34) any known iterative method for solving general linear systems can be used. In our analysis we focus on linearly convergent methods. For example based on the properties of the linear system (34), conjugate gradient (CG) or sketch and project method (SPM) can be used for the execution of step 3. In these cases, we name Algorithm 2, InexactCG and InexactSP respectively.
It is known that the classical CG can solve linear systems with positive definite matrices. In our approach matrix M k is positive definite only when the original linear system Ax = b has full rank matrix A. On the other side SPM can solve any consistent linear system and as a result can solve the inner linear system M k λ k = d k without any further assumption on the original linear system. In this case, one should be careful because the system has no unique solution. We are interested to find the least norm solution of M k λ k = d k which means that the starting point of the sketch and project at the k th iteration should be always λ 0 k = 0. Recall that any special case of the sketch and project method (Section 2.3) solves the best approximation problem.
Let us now define λ r k to be the approximate solution λ ≈ k of the q × q linear system (34) obtained after r steps of the linearly convergent iterative method. Using this, the update rule of Algorithm 2, takes the form: Remark 2. The update rule (35) of Algorithm 2 is equivalent to the update rule of iBasic (Algorithm 1) when the error k is chosen to be, This is precisely the connection between the abstract and more concrete/structured notion of inexactness that first presented in Table 2.
Let us now define a Lemma that is useful for the analysis of this section and it verifies that Algorithm 2 with unit stepsize satisfies the general Assumption 2 presented in Section 3.1.

Lemma 4.
Let us denote x * k = Π L S k ,B (x k ) the projection of x k onto L S k in the B-norm and x * = Π L,B (x 0 ). Let also assume that ω = 1 (unit stepsize). Then for the updates of Algorithm 2 it holds that: Figure 1: Graphical interpretation of orthogonality (justifies equation (37)). It shows that the two vectors, x * k − x * and k , are orthogonal complements of each other with respect to the B-inner product. x k+1 is the point that Algorithm 2 computes in each step. The colored region represents the N ull(S k A). x * k = Π L S k ,B (x k ), x * = Π L,B (x 0 ) and k is the inexactness error.
. From the knowledge that the null space of an arbitrary matrix is the orthogonal complement of the range space of its transpose we have that N ull(S k A) is orthogonal with respect to the B-inner product to Range(B −1 A S k ). This completes the proof (see Figure 1 for the graphical interpretation).

Sketch and Project Interpretation
Let us now give a different interpretation of the inexact update rule of Algorithm 2 using the sketch and project approach. That will make us appreciate more the importance of the dual viewpoint and make clear the connection between the primal and dual methods.
Recall that in the special case of unit stepsize (see equation (9)) the exact sketch and project method perform updates of the form: That is, a sketched system S Ax = S b is first chosen and then a the next iterate is computed by making a projection of the current iterate x k onto this system. In general, execute a projection step is one of the most common task in numerical linear algebra/optimization literature. However in the large scale setting even this task can be prohibitively expensive and it can be difficult to execute inexactly. For this reason we suggest to move to the dual space where the inexactness can be easily controlled.
Observe that the update rule of equation (38) has the same structure as the best approximation problem (7) where the linear system under study is the sketched system S k Ax = S k b and the starting point is the current iterate x k . Hence we can easily compute its dual: where λ ∈ R q is the dual variable. The λ * k (possibly more than one) that solves the dual problem in each iteration k, is the one that satisfies ∇D k (λ * k ) = 0. By computing the derivative this is equivalent with finding the λ that satisfies the linear system S k AB −1 A S k λ = S k (b − Ax k ). This is the same linear system we desire to solve inexactly in Algorithm 2. Thus, computing an inexact solution λ ≈ k of the linear system is equivalent with computing an inexact solution of the dual problem (39). Then by using the affine mapping (13) that connects the primal and the dual spaces we can also evaluate an inexact solution of the original primal problem (38).
The following result relates the inexact levels of these quantities. In particular it shows that dual suboptimality of λ k in terms of dual function values is equal to the distance of the dual values λ k in the M k -norm.
Lemma 5. Let us define λ * k ∈ R q be the exact solution of the linear system S k AB −1 A S k λ = S k (b − Ax k ) or equivalently of dual problem (39). Let us also denote with λ ≈ k ∈ R q the inexact solution. Then: Proof.
where in the second equality we use equation (13) to connect the optimal solutions of (38) and (39) and

Complexity Results
In this part we analyze the performance of Algorithm 2 when a linearly convergent iterative method is used for solving inexactly the linear system (34) in step 3 of Algorithm 2 . We denote with λ r k the approximate solution of the linear system after we run the iterative method for r steps. Before state the main convergence result let us present a lemma that summarize some observations that are true in our setting.
be the exact solution and λ r k be approximate solution of the linear system (34). Then, λ * Moreover, Theorem 7. Let us assume that for the computation of the inexact solution of the linear system (34) in step 3 of Algorithm 2, a linearly convergent iterative method is chosen such that 12 : where λ 0 k = 0 for any k > 0 and ρ S k ∈ (0, 1) for every choice of S k ∼ D. Let exactness hold and let {x k } ∞ k=0 be the iterates produced by Algorithm 2 with unit stepsize (ω = 1). Set x * = Π L,B (x 0 ). Suppose further that there exists a scalar θ < 1 such that with probability 1, ρ S k ≤ θ. Then, Algorithm 2 converges linearly with: 12 In the case that deterministic iterative method is used, like CG, we have that λ r k − λ * which is also true in expectation Proof. Theorem 7 can be interpreted as corollary of the general Theorem 3(iii). Thus, it is sufficient to show that Algorithm 2 satisfies the two Assumptions 1c and 2. Firstly, note that from Lemma 4, Assumption 2 is true. Moreover, which means that Assumption 1c also holds with q = θ r/2 ∈ (0, 1). This completes the proof.
Having present the main result of this section let us now state some remarks that will help understand the convergence rate of the last Theorem.
Remark 3. From its definition θ r ∈ (0, 1) and as a result (1 − θ r ) λ + min ≤ λ + min . This means that the method converges linearly but always with worst rate than its exact variant.
Remark 4. Let as assume that θ is fixed. Then as the number of iterations in step 3 of the algorithm (r → ∞) increasing (1 − θ r ) → 1 and as a result the method behaves similar to the exact case.
Remark 5. The λ + min depends only on the random matrices S ∼ D and to the positive definite matrix B and is independent to the iterative process used in step 3. The iterative process of step 3 controls only the parameter θ of the convergence rate.
Remark 6. Let us assume that we run Algorithm 2 two separate times for two different choices of the linearly convergence iterative method of step 3. Let also assume that the distribution D of the random matrices and the positive definite matrix B are the same for both instances and that for step 3 the iterative method run for r steps for both algorithms. Let assume that θ 1 < θ 2 then we have that This means in the case that θ is easily computable, we should always prefer the inexact method with smaller θ.
The convergence of Theorem 7 is quite general and it holds for any linearly convergent methods that can inexactly solve (34). However, in case that the iterative method is known we can have more concrete results. See below the more specified results for the cases of Conjugate gradient (CG) and Sketch and project method (SPM).
Convergence of InexactCG: CG is deterministic iterative method for solving linear systems Ax = b with symmetric and positive definite matrix A ∈ R n×n in finite number of iterations. In particular, it can be shown that converges to the unique solution in at most n steps. The worst case behavior of CG is given by [55,17] 13 : where x k is the k th iteration of the method and κ(A) the condition number of matrix A.
Having present the convergence of CG for general linear systems, let us now return back to our setting. We denote λ r k ∈ R q to be the approximate solution of the inner linear system (34) after r conjugate gradient steps. Thus using (43) . Now by making the same assumption as the general Theorem 7 the ≤ θ CG with probability 1. 13 A sharper convergence rate of CG [55] for solving Ax = b can be also used Convergence of InexactSP: In this setting we suggest to run the sketch and project method (SPM) for solving inexactly the linear system (34). This allow us to have no assumptions on the structure of the original system Ax = b and as a result we are able to solve more general problems compared to what problems InexactCG can solve 14 . Like before, by making the same assumptions as in Theorem 7 the more specific convergence , for the InexactSP can be obtained. Now the quantity ρ S k denotes the convergence rate of the exact Basic method 15 when this applied to solve linear system (34) and θ SP < 1 is a scalar such that ρ S k ≤ θ SP with probability 1.

Inexact Dual Method
In the previous sections we focused on the analysis of inexact stochastic methods for solving the stochastic optimization problem (1) and the best approximation (7). In this section we turn into the dual of the best approximation (10) and we propose and analyze an inexact variant of the SDSA (11). We call the new method iSDSA and is formalized as Algorithm 3. In the update rule d k indicates the dual inexactness error that appears in the k th iteration of iSDSA.

3:
Set : end for

Correspondence Between the Primal and Dual Methods
With the sequence of the dual iterates {y k } ∞ k=0 produced by the iSDSA we can associate a sequence of primal iterates {x k } ∞ k=0 using the affine mapping (13). In our first result we show that the random iterates produced by iBasic arise as an affine image of iSDSA under this affine mapping.
Theorem 8. (Correspondence between the primal and dual methods) Let {x k } ∞ k=0 be the iterates produced by iBasic (Algorithm 1). Let y 0 = 0, and {y k } ∞ k=0 the iterates of the iSDSA. Assume that the two methods use the same stepsize ω > 0 and the same sequence of random matrices S k . Assume also that k = B −1 A d k where k and d k are the inexactness errors appear in the update rules of iBasic and iSDSA respectively. Then for all k ≥ 0. That is, the primal iterates arise as affine images of the dual iterates.
Proof. 14 Recall that InexactCG requires the matrix M k to be positive definite (this is true when matrix A is a full rank matrix) 15 Recall that iBasic and its exact variant ( k = 0) can be expressed as sketch and project methods (20).
Thus by choosing the inexactness error of the primal method to be k = B −1 A d k the sequence of vectors {φ(y k )} satisfies the same recursion as the sequence {x k } defined by iBasic. It remains to check that the first element of both recursions coincide. Indeed, since y 0 = 0, we have x 0 = φ(0) = φ(y 0 ).

iSDSA with Structured Inexactness Error
In this subsection we present Algorithm 4. It can be seen as a special case of iSDSA but with a more structured inexactness error.

3:
Using an Iterative method compute an approximation λ ≈ k of the least norm solution of the linear system:

4:
Set y k+1 = y k + ωS k λ ≈ k 5: end for Similar to their primal variants, it can be easily checked that Algorithm 4 is a special case of the iSDSA ( Algorithm 3) when the dual inexactness error is chosen to be d k = S k (λ r k − λ * k ). Note that, using the observation of Remark 2 that k = ωB −1 A S k (λ r k −λ * k ) and the above expression of d k we can easily verify that the expression k = B −1 A d k holds. This is precisely the connection between the primal and dual inexactness errors that have already been used in the proof of Theorem 8.

Convergence of Dual Function Values
We are now ready to state a linear convergence result describing the behavior of the inexact dual method in terms of the function values D(y k ). The following result is focused on the convergence of iSDSA by making similar assumption to Assumption 1b. Similar convergence results can be obtained using any other assumption of Section 3.1. The convergence of Algorithm 4, can be also easily derived using similar arguments with the one presented in Section 4 and the convergence guarantees of Theorem 7.
Theorem 9. (Convergence of dual objective). Assume exactness. Let y 0 = 0 and let {y k } ∞ k=0 to be the dual iterates of iSDSA (Algorithm 3) with ω ∈ (0, 2). Set x * = Π L,B (x 0 ) and let y * be any dual optimal solution. Consider the inexactness error d k be such that it satisfies Proof. The proof follows by applying Theorem 2 together with Theorem 8 and the identity 1 2 x k − x *

Numerical Evaluation
In this section we perform preliminary numerical tests for studying the computational behavior of iBasic with structured inexactness error when is used to solve the best approximation problem (7) or equivalently the stochastic optimization problem (1) 16 . As we have already mentioned, iBasic can be interpreted as sketch-and-project method, and as a result a comprehensive array of wellknown algorithms can be recovered as special cases by varying the main parameters of the methods (Section 2.3). In particular, in our experiments we focus on the evaluation of two popular special cases, the inexact Randomized Block Kaczmarz (iRBK) (equation (21)) and inexact randomized block coordinate descent method (iRBCD) (equation (22))We implement Algorithm 2 presented in Section 4 using CG 17 to inexactly solve the linear system of the update rule (equation (34)).
Recall that in this case we named the method InexactCG.
The convergence analysis of previous sections is quite general and holds for several combinations of the two main parameters of the method, the positive definite matrix B and the distribution D of the random matrices S. For obtaining iRBK as special case we have to choose B = I ∈ R n×n (Identity matrix) and for the iRBCD the given matrix A should be positive definite and choose B = A. For both methods the distribution D should be over random matrices S = I :C where I :C is the column concatenation of the m × m identity matrix indexed by a random subset C of [m].
In our experiments we choose to have one specific distribution over these matrices. In particular, we assume that the random matrix in each iteration is chosen uniformly at random to be S = I :d with the subset d of [m] to have fixed pre-specified cardinality.
The code for all experiments is written in the Julia 0.6.3 programming language and run on a Mac laptop computer (OS X El Capitan), 2.7 GHz Intel Core i5 with 8 GB of RAM.
To coincide with the theoretical convergence results of Algorithm 2 the relaxation parameter (stepsize) of the methods study in our experiments is chosen to be ω = 1 (no relaxation). In all implementations, we use x 0 = 0 ∈ R n as an initial point and in comparing the methods with their inexact variants we use the relative error measure We run each method (exact and inexact) until the relative error is below 10 −5 . For the horizontal axis we use either the number of iterations or the wall-clock time measured using the tic-toc Julia function. In the exact variants, the linear system (34) in Algorithm 2 needs to be solved exactly. In our experiments we follow the implementation of [18] for both exact RBCD and exact RBK where the built-in direct solver (sometimes referred to as "backslash") is used.
Experimental setup: For the construction of consistent linear systems Ax = b we use the following setup: • For iRBK: Let matrix A ∈ R m×n being given (it can be either synthetic or real data). Then a vector z ∈ R n is chosen to be i.i.d N (0, 1) and the right hand side of the linear system is set to b = Az. With this way the consistency of the linear system with matrix A and right hand side b is ensured.
• For iRBCD: A Gaussian matrix P ∈ R m×n is generated and then matrix A = P P ∈ R n×n is used in the linear system (with this way matrix A is positive definite with probability 1). The vector z ∈ R n is chosen to be i.i.d N (0, 1) and again to ensure consistency of the linear system, the right hand side is set to b = Az.

Importance of Large Block Size
Many recent works have shown that using larger block sizes can be very beneficial for the performance of randomized iterative algorithms [18,44,34,27]. In Figure 2 we numerically verify this Recall that this is precisely the class of algorithms that could have an expensive subproblem in their update rule which is required to be solved exactly and as a result can benefit the most from the introduction of inexactness.

Inexactness and Block Size (iRBCD)
In this experiment, we first construct a positive definite linear system following the previously described procedure for iRBCD. We first generate a Gaussian matrix P ∈ R 10000×7000 and then the positive definite matrix A = P P ∈ R 7000×7000 is used to define a consistent liner system. We run iRBCD in this specific linear system and compare its performance with its exact variance for several block sizes d (numbers of column of matrix S). For evaluating the inexact solution of the linear system in the update rule we run CG for either 2, 5 or 10 iterations. In Figure 3, we plot the evolution of the relative error in terms of both the number of iterations and the wall-clock time.
We observe that for any block size the inexact methods are always faster in terms of wall clock time than their exact variants even if they require (as is expected) equal or larger number of iterations. Moreover it is obvious that the performance of the inexact method becomes much better than the exact variant as the size d increases and as a results the sub-problem that needs to be solved in each step becomes more expensive. It is worth to highlight that for the chosen systems, the exact RBCD behaves better in terms of wall clock time as the size of block increases (this coincides with the findings of the previous experiment).

Evaluation of iRBK
In the last experiment we evaluate the performance of iRBK in both synthetic and real datasets. For computing the inexact solution of the linear system in the update rule we run CG for pre-specified number of iterations that can vary depending the datasets. In particular, we compare iRBK and RBK on synthetic linear systems generated with the Julia Gaussian matrix functions "randn(m,n)" and "sprandn(m,n,r)" (input r of sprandn function indicates the density of the matrix). For the real datasets, we test the performance of iRBK and RBK using real matrices from the library of support vector machine problems LIBSVM [8]. Each dataset of the LIBSVM consists of a matrix A ∈ R m×n (m features and n characteristics) and a vector of labels b ∈ R m . In our experiments we choose to use only the matrices of the datasets and ignore the label vectors 18 . As before, to ensure consistency of the linear system, we choose a Gaussian vector z ∈ R n and the right hand side of the linear system is set to b = Az (for both the synthetic and the real matrices). By observing Figure 4 it is clear that for all problems under study the performance of iRBK in terms of wall clock time is much better than its exact variant RBK.

Conclusion
In this work we propose and analyze inexact variants of several stochastic algorithms for solving quadratic optimization problems and linear systems. We provide linear convergence rate under several assumptions on the inexactness error. The proposed methods require more iterations than their exact variants to achieve the same accuracy. However, as we show through our numerical evaluations, the inexact algorithms require significantly less time to converge.
With the continuously increasing size of datasets, inexactness should definitely be a tool that practitioners should use in their implementations even in the case of stochastic methods that have much cheaper-to-compute iteration complexity than their deterministic variants. Recently, accelerated and parallel stochastic optimization methods [29,46,54] have been proposed for solving linear systems. We speculate that the addition of inexactness to these update rules will lead to methods faster in practice. We also believe that our approach and complexity results can be extended to the more general case of minimization of convex and non-convex functions in the stochastic setting. Finally, sketch-and-project algorithms have been used for solving the average consensus problem [27,20] popular in distributed optimization literature. Our results could also be useful in this area and lead to the development of novel randomized gossip algorithms that use inexactness in their update rule.

B Proofs of Main Results
In our convergence analysis we use several popular inequalities. Look Table 3 in Appendix C for  the abbreviations and the relevant formulas. A key step in the proofs of the theorems is to use the tower property of the expectation. We use it in the form Now let us find a bound on the quantity E (I − ωB −1 Z k )(x k − x * ) B | x k . Note that from (48) and (47) . By using Remark 7 in the last inequality we obtain: By substituting (57) in (56): We take the final expectation (outermost expectation in the tower rule (49)) on the above expression to find: Using r k = E x k − x * 2 B equation (59) takes the form: If we further substitute p k = √ r k and = √ ρ the recurrence simplifies to: By unrolling the final inequality: p k ≤ k r 0 + ( 0 σ k−1 + σ k−2 + · · · + k−1 σ 0 ) = k p 0 + k−1 i=0 k−1−i σ i . Hence, The result is obtained by using V.I in the last expression.

B.2 Proof of Corollary 1
By denoting r k = E[ x k − x * B ] in (28) we obtain: Since 1 − ρ k ≤ 1 the result is obtained.

B.3 Proof of Theorem 2
In order to prove Theorem 2 we need to follow similar steps to the proof of Theorem 1. The main differences of the two proofs appear at the points that we need to upper bound the norm of the inexactness error ( k 2 ). In particular instead of using the general sequence σ 2 k ∈ R we utilize the bound q 2 x k − x * Similar to the proof of Theorem 1 we first decompose to obtain the equation (51). There, the expression T1 can be upper bounded from (53) but now using the Assumption 1b the expression T2 and T3 can be upper bounded as follows: (60) Remark 7 and (54) As a result by substituting the bounds (53), (60), and (61) into (51) we obtain: By following the same steps to the proof of Theorem 1 the equation (58) takes the form: We take the final expectation (outermost expectation in the tower rule (49)) on the above expression to find: The final result follows by unrolling the recurrence.

B.4 Proof of Theorem 3
Proof. Similar to the previous two proofs by decomposing the update rule and using the innermost expectation of (49) we obtain equation (51). An upper bound of expression T1 is again given by inequality (53). For the expression T2 depending the assumption that we have on the norm of the inexactness error different upper bounds can be used. In particular, (i) If Assumption 1 holds then: (ii) If Assumption 1b holds then: (iii) If Assumption 1c holds then: The main difference from the previous proofs, is that due to the Assumption 2 and tower property (49) the expression T3 will eventually be equal to zero. More specifically, we have that: Thus, in this case equation (55) takes the form: Using the above expression depending the assumption that we have we obtain the following results: (i) By taking the middle expectation (see (49)) and apply it to the above inequality: cond. Jensen X random variable Table 3: Popular inequalities with abbreviations and formulas.
The Basics A, b m × n matrix and m × 1 vector defining the system Ax = b L {x : Ax = b} (solution set of the linear system) B n × n symmetric positive definite matrix x, y B x By (B-inner product) x B x,