STOCHASTIC PRIMAL-DUAL HYBRID GRADIENT ALGORITHM WITH ARBITRARY SAMPLING AND IMAGING APPLICATIONS

. We propose a stochastic extension of the primal-dual hybrid gradient algorithm studied by Chambolle and Pock in 2011 to solve saddle point problems that are separable in the dual variable. The analysis is carried out for general convex-concave saddle point problems and problems that are either partially smooth / strongly convex or fully smooth / strongly convex. We perform the analysis for arbitrary samplings of dual variables, and obtain known deterministic results as a special case. Several variants of our stochastic method signiﬁcantly outperform the deterministic variant on a variety of imaging tasks.

1. Introduction.Many modern problems in a variety of disciplines (imaging, machine learning, statistics, etc.) can be formulated as convex optimization problems.Instead of solving the optimization problems directly, it is often advantageous to reformulate the problem as a saddle point problem.A very popular algorithm to solve such saddle point problems is the primal-dual hybrid gradient (PDHG)1 algorithm [37,21,13,36,14,15].It has been used to solve a vast amount of stateof-the-art problems-to name a few examples in imaging: image denoising with the structure tensor [22], total generalized variation denoising [11], dynamic regularization [7], multi-modal medical imaging [27], multi-spectral medical imaging [43], computation of non-linear eigenfunctions [26], regularization with directional total generalized variation [29].Its popularity stems from two facts: First, it is very simple and therefore easy to implement.Second, it involves only simple operations like matrix-vector multiplications and evaluations of proximal operators which are for many problems of interest simple and in closed-form or easy to compute iteratively, cf.e.g.[33].However, for large problems that are encountered in many real world applications, even these simple operations might be still too costly to perform too often.
We propose a stochastic extension of the PDHG for saddle point problems that are separable in the dual variable (cf.e.g.[18,50,52,34]) where not all but only a few of these operations are performed in every iteration.Moreover, as in incremental optimization algorithms [47,31,10,9,8,45,19] over the course of the iterations we continuously build up information from previous iterations which reduces variance and thereby negative effects of stochasticity.Non-uniform samplings [40,38,50,39,2] have been proven very efficient for stochastic optimization.In this work we use the expected separable overapproximation framework of [38,39,41] to prove all statements for all non-trivial and iteration-independent samplings.
Related Work.The proposed algorithm can be seen as a generalization of the algorithm of [18,52,50] to arbitrary blocks and a much wider class of samplings.Moreover, in contrast to their results, our results generalize the deterministic case considered in [37,13,36,15].Fercoq and Bianchi [23] proposed a stochastic primal-dual algorithm with explicit gradient steps that allows for larger step-sizes by averaging over previous iterates, however, this comes at the cost of prohibitively large memory requirements.Similar memory issues are encountered by a primal-dual algorithm of [4].It is related to forward-backward splitting [30] and averaged gradient descent [10,20] and therefore suffers the same memory issues as the averaged gradient descent.Moreover, Valkonen proposed a stochastic primal-dual algorithm that can exploit partial strong convexity of the saddle point functional [48].Randomized versions of the alternating direction method of multipliers are discussed for instance in [51,25].In contrast to other works on stochastic primal-dual algorithms [35,49], our analysis is not based on Fejér monotonicity [16].We therefore do not prove almost sure convergence of the sequence but prove a variety of convergence rates depending on strong convexity assumptions instead.For smooth / strongly-convex problems, our analysis implies that the variance of the algorithm converges to zero which we will show empirically to be a dividing factor between our work and [35].
As a word of warning, our contribution should not be mistaken by other "stochastic" primal-dual algorithms, where errors in the computation of matrix-vector products and evaluation of proximal operators are modeled by random variables, cf.e.g.[35,16,44].In our work we deliberately choose to compute only a subset of a whole iteration to save computational cost.These two notations are related but are certainly not the same.
1.1.Contributions.We briefly mention the main contributions of our work.Generalization of Deterministic Case.The proposed stochastic algorithm is a direct generalization of the deterministic setting [37,13,36,14,15].In the degenerate case where in every iteration all computations are performed, our algorithm coincides with the original deterministic algorithm.Moreover, the same holds true for our analysis of the stochastic algorithm where we recover almost all deterministic statements (boundedness, convergence rates etc.) [13,36] in this degenerate case.Therefore, the theorems for both the deterministic and the stochastic case can be combined by a single proof.
Better Rates.Our analysis extends the simple setting of [50] such that the strong convexity assumptions and the sampling do not have to be uniform.In the special case of uniform strong convexity and uniform sampling, the proven convergence rates are better than the ones proven in [50].
Arbitrary Sampling.The proposed algorithm is guaranteed to converge under a very general class of samplings [38,39,41] and thereby generalizes also the algorithm of [50] which has only been analyzed for two specific samplings.As long as the sampling is independent and identically distributed over the iterations and all computations have non-zero probability to be carried out, the theory holds and the algorithm will converge with the proven convergence rates.
Acceleration.We propose an acceleration of the stochastic primal-dual algorithm which accelerates the convergence from O(1/K) to O(1/K 2 ) if parts of the saddle point functional are strongly convex and thereby results in a significantly faster algorithm.
Scaling Invariance.In the strongly convex case, we propose parameters for several serial samplings (uniform, importance, optimal), all based on the condition numbers of the problem and thereby independent of scaling.
2. General Problem.Let X, Y i , i = 1, . . ., n be real Hilbert spaces of any dimension and define the product space Y := n i=1 Y i .For y ∈ Y, we shall write y = (y 1 , y 2 , . . ., y n ), where y i ∈ Y i .Further, we consider the natural inner product on the product space Y given by y, z = n i=1 y i , z i , where y i , z i ∈ Y i .This inner product induces the norm y 2 = n i=1 y i 2 .Let A : X → Y be a bounded linear operator.Due to the product space nature of Y, we have (Ax) i = A i x, where A i : X → Y i are linear operators.The adjoint of A is given by Given the setup described above, we consider the optimization problem Instead of solving (1) directly, it is often desirable to reformulate the problem as a saddle point problem with the help of the Fenchel conjugate.If f is proper, convex, and lower semi-continuous, then f (y) = f * * (y) = sup z∈Y z, y −f * (z) where f * : Y → R ∪ {−∞, +∞}, and f * (y) = n i=1 f * i (y i ) is the Fenchel conjugate of f (and f * * its biconjugate defined as the conjugate of the conjugate).Then solving (1) is equivalent to finding the primal part x of a solution to the saddle point problem (called a saddle point) We will assume that the saddle point problem (2) has a solution.For conditions for existence and uniqueness, we refer the reader to [5].A saddle point (x , y ) = (x , y 1 , . . ., y n ) satisfies the optimality conditions • 2 is convex.In general, we assume that g is µ g -convex, f * i is µ i -convex with nonnegative strong convexity parameters µ g , µ i ≥ 0. The convergence results in this contribution cover three different cases of regularity: i) no strong convexity µ g , µ i = 0, ii) semi-strong convexity µ g > 0 or µ i > 0 and iii) full strong convexity µ g , µ i > 0. For notational convenience we make use of the operators M i := µ i I, M g := µ g I and M f := diag(M 1 , . . ., M n ).
A very popular algorithm to solve the saddle point problem (2) is the Primal-Dual Hybrid Gradient (PDHG) algorithm [37,21,13,36,14,15].It reads (with extrapolation on y) where the proximal operator (or proximity / resolvent operator) is defined as and the weighted norm by x 2 τ −1 = τ −1 x, x .Its convergence is guaranteed if the step size parameters σ, τ are positive and satisfy στ A 2 < 1 and θ = 1 [13].Note that the definition of the proximal operator is well-defined for a operator-valued step size τ .In the case of a separable function f and with operator-valued step sizes the PDHG algorithm takes the form Here the step size parameters S = diag(S 1 , . . ., S n ) (a block diagonal operator) and S 1 , . . ., S n and T are symmetric and positive definite.The algorithm is guaranteed to converge if S 1/2 AT 1/2 < 1 and θ = 1 [36].

Algorithm.
In this work we extend the PDHG algorithm to a stochastic setting where in each iteration we update a random subset S of the dual variables (3b).This subset is sampled in an i.i.d.fashion from a fixed but otherwise arbitrary distribution, whence the name "arbitrary sampling".In order to guarantee convergence, it is necessary to assume that the sampling is "proper" [42,39].A sampling is proper if for each dual variable i we have i ∈ S with a positive probability p i > 0. Examples of proper samplings include the full sampling where S = {1, . . ., n} with probability 1 and serial sampling where S = {i} is chosen with probability p i .It is important to note that also other samplings are admissible.For instance for n = 3, consider the sampling that selects S = {1, 2} with probability 1/3 and S = {2, 3} with probability 2/3.Then the probabilities for the three blocks are p 1 = 1/3, p 2 = 1 and p 3 = 2/3 which makes it a proper sampling.However, if only S = {1, 2} is chosen with probability 1, then this sampling is not proper as the probability for the third block is zero: The algorithm we propose is formalized as Algorithm 1.As in the original algorithm, the stepsize parameters T, S i have to be self-adjoint and positive definite operators for the updates to be well-defined.The extrapolation is performed with a scalar θ > 0 and an operator P := diag(p 1 I, . . ., p n I) of probabilities p i that an index is selected in each iteration.
Remark 1. Both, the primal and dual iterates x k+1 and y k+1 are random variables but only the dual iterate y k+1 depends on the sampling S k+1 .
Remark 2 (Memory).For memory efficiency, the algorithm can be coded in a different order, where the primal update follows the dual update.Our analysis obviously translates to this memory efficient algorithm.
Remark 3 (Computation).Due to the sampling each iteration requires both A i and A * i to be evaluated only for each selected index i ∈ S k+1 .To see this, note that A * y k can be stored from the previous iteration (needs memory of the size of the primal variable x) and update requires the operators A * i to be evaluated only for i ∈ S k+1 .4. General Convex Case.We first analyze the convergence of Algorithm 1 in the general convex case without making use of any strong convexity or smoothness assumptions.In order to analyze the convergence for the large class of samplings described in the previous section we make use of the expected separable overapproximation (ESO) inequality [39].Definition 4.1 (Expected Separable Overapproximation (ESO) [39]).
Such parameters {v i } are called ESO parameters.
Remark 4. Note that for any operator C * such parameters always exist but are obviously not unique.For the efficiency of the algorithm it is desirable to find ESO parameters such that (4) is as tight as possible; i.e., we want the parameters {v i } be small.As we shall see, the ESO parameters influence the choice of the aggregation parameter θ.
The ESO inequality was first proposed by Richtárik and Takáč [42] to study parallel coordinate descent methods in the context of uniform samplings, which are samplings for which p i = p j for all i, j.Improved bounds for ESO parameters were obtained in [24] and used in the context of accelerated coordinate descent.Qu et al. [39] perform an in-depth study of ESO parameters.The ESO inequality is also critical in the study mini-batch stochastic gradient descent with [28] or without [46] variance reduction.
We will frequently need to estimate the expected value of inner products which we will do by means of ESO parameters.Recall that we defined weighted norms as Lemma 4.2.Let y k be defined as in Algorithm 1, v i be some ESO parameters of Denote the expected value with respect to the sampling S k by E k .Then for any x ∈ X and c > 0 .
For notational convenience we define 2 the function h Theorem 4.3.Let (x , y ) ∈ X × Y be any saddle point, g, f i be convex and the extrapolation parameter θ = 1.Let the step sizes T, S be chosen such that 0 < γ 2 < 1 bounds some ESO parameters as defined by Lemma 4.2.Then, the iterates of Algorithm 1 satisfy the following assertions.
1.The sequence (x K , y K ) is bounded in expectation in the sense that where the constant is given by 2. The Bregman distances between iterates (x K , y K ) and a saddle point are almost surely summable, i.e.
In particular, the Bregman distances converge to zero almost surely.

The ergodic sequence
where the constant is given by (7).
Proof of Theorem 4.3.The result of Lemma A.1 has to be adapted to the stochastic setting as the next iterations takes the value of ŷk+1 i only with a certain probability.
First, for the Bregman distance of f * we derive with the standard result of Lemma A.3 , y ) where we used the Bregman distance of h * given by (5).Using this result and again Lemma A.3 we can rewrite the estimate of Lemma A.1 as 1 2 where we have used the identity (SP) −1 to simplify the expression.Plugging in the extrapolation Summing ( 9) over k = 0, . . ., K − 1 (note that y −1 := y 0 ) yields Moreover, estimating the inner products with Lemma 4.2 and taking expectations with respect to S 1 , . . ., S K (denoting this by E) yields All assertions of the theorem follow from inequality (10).It is easy to see that the sequence (x K , y K ) is bounded in the sense of Item 1 as Bregman distances and norms are non-negative.For Item 2, note that it follows from ( 10) that ) < ∞ and thus that the Bregman distance between the iterates and any saddle point converges to zero almost surely.To see Item 3, note that Bregman distances of convex functions are convex in the first argument.Thus, dividing (10) by K yields which was to be shown.

Semi-Strongly Convex Case.
In this section we propose two algorithms that converge as O(1/k 2 ) if either f * i or g is strongly convex.
Remark 7. The convergence of Algorithm 2 with acceleration on the primal variable is similar to the deterministic case, cf.Appendix C.2 of [14], and omitted here for brevity.It converges with rate O(1/K 2 ) if there are ESO parameters that are bounded by γ 2 < 1.
Proof of Theorem 5.1.The update on the step sizes in Algorithm 3 imply that To see the latter, note the inequality is satisfied if for all i ∈ {1, . . ., n}.With the auxiliary sequence 13) is satisfied as soon as Note that the transformation from σk to σ k i is well-defined if σk < min i pi 2(1−pi) .Under the additional assumption σk+1 = θ k σk , ( 14) is solved with equality by θ k = (1 + 2σ k ) −1/2 .Moreover, the sequence of dual step sizes satisfies which shows that the step size condition of Lemma 4.2 holds if it holds for the initial step size parameters.
For the actual proof of the theorem, note that the inequalities ( 12) imply with Moreover, combining Lemma A.2 and ( 15) yields where we used that Using this inequality recursively, y −1 := y 0 , we arrive at where the inner product for the second inequality is estimated by Lemma 4.2.The third inequality is due to the non-negativity of norms and Y 0 which holds by the definition of σk .Moreover, by the definition of θ k = σk+1 σk (16) can be further simplified to Finally, the assertion follows by Corollary 1 of [13].
6. Strongly Convex Case.If both f * i and g are strongly convex, we may find step size parameters such that the algorithm Algorithm 1 converges linearly.Theorem 6.1.Let (x , y ) ∈ X × Y be a saddle point and g, f * i be strongly convex with constants µ g , µ i > 0, i = 1, . . ., n.Let the step sizes S, T, 0 < θ < 1 be chosen such that γ 2 bounds some ESO parameters as defined by Lemma 4.2, γ 2 θ < 1 and (17) θ(I + 2M g T) ≥ I, and θ(I in a positive semidefinite sense for operators.Let X := T −1 + 2M g , Y := (S −1 + 2M f )P −1 .Then the iterates of Algorithm 1 converge linearly to the saddle point, i.e.
Proof.The requirements (17) on the step sizes S, T and θ imply . Thus, we directly get where we denoted 18) and Lemma A.2 with constant step sizes yields Multiplying both sides by θ −(k+1) and summing over k = 0, . . ., K − 1 yields where we used again Lemma 4.2 and the non-negativity of norms for the second inequality.Thus, the assertion is proven.

Invariant Analysis.
A desirable property of an algorithm is scaling invariance, i.e. the algorithm behaves independent of the scale of all variables and unknowns.If we rewrite problem (2) in terms of the scaled primal variable x := αx and dual variables y i := β i y i , then the corresponding operators A i := A i /(αβ i ) have norm A i = A i /(αβ i ), the function g(x) := g(x/α) is µ g := µ g /α 2 strongly convex and the functions f * i (y i ) := f * i (y i /β i ) are µ i := µ i /β 2 i strongly convex.Thus the condition numbers κ i := A i 2 /(µ g µ i ) are scaling invariant.

Scalar Parameters for Serial
Sampling.This analysis is to optimize the convergence rate θ of Theorem 6.1 for three different serial sampling options where exactly one block is chosen in each iteration.Other sampling strategies, including multi-block, parallel, etc. [39] will be subject of future work.For the ease of exposition we assume that T = τ I, S i = σ i I, i = 1, . . ., n are all scalar in the following.With σi := σ i µ i and τ := τ µ g , the conditions on the step size (17) become , and max for some ρ < 1.The last condition arises from the ESO parameters of serial sampling.Finding optimal parameters is equivalent to equating the above inequalities.Note that the first two conditions (with equality) are equivalent to θτ = (1 − θ)/2 and σi = . With these choices, the third condition in (19) reads where we denote κ = κ/ρ 2 .It follows from ( 20) that for any i = 1, . . ., n it holds 1. Serial uniform sampling: We first consider uniform sampling, i.e. every block is sampled with the same probability p i = 1/n.Then it is easy to see that the smallest achievable rate is given by where κmax := max j κj and the step sizes are then 2. Serial importance sampling: Instead of uniform sampling we may sample "important blocks" more often, i.e. we sample every block with a probability proportional to the square root of its condition number Then the smallest rate that achieves ( 21) is given by and with ν := 1+κmin , κmin := min j κj and the step sizes are 3. Serial optimal sampling: Instead of a predefined probability we will seek for an "optimal sampling" that minimizes the linear convergence rate θ.The optimal sampling can be found by equating condition (21) for i = 1, . . ., n 7. Numerical Results.All numerical examples are implemented in python using numpy and the operator discretization library (ODL) [1].The python code and all example data will be made available upon acceptance of this manuscript.
7.1.Non-Strongly Convex PET Reconstruction.In this example we consider positron emission tomography (PET) reconstruction with a total variation (TV) prior.The goal in PET imaging is to reconstruct the distribution of a radioactive tracer from its line integrals [32].Let X = R d1×d2 (d 1 , d 2 = 250) be the space of tracer distributions (images) and Y i = R |Bi| the data spaces where B i ⊂ {1, . . ., N }, N = 250×354 (250 views around the object) are subsets of indices with B i ∩B j = ∅ if i = j and ∪ n i=1 B i = {1, . . ., n}.All samplings in this example divide the views equidistantly.It is standard that PET reconstruction can be posed as the optimization problem where the data fidelity term is given by the Kullback-Leibler divergence 0 log 0 := 0. The operator A is a scaled X-ray transform where in each of 250 directions 354 line integrals are computed.The prior is the TV of x with nonnegativity constraint, i.e. g(x) = α ∇x 1,2 + χ ≥0 (x), with regularization parameter α = 4 and the gradient operator ∇x ∈ R d1×d2×2 is discretized by forward differences, cf.[12] for details.The 1, 2-norm of these gradients is defined as x 1,2 := d1 i=1 d2 j=1 (∇ 1 x i,j ) 2 + (∇ 2 x i,j ) 2 .The Fenchel conjugate of the Kullback-Leibler divergence ( 27) is its proximity operator given by The proximal operator for g is approximated with 5 iterations of the fast gradient projection method (FGP) [6] with a warm start applied to the dual problem.
Parameters.In this experiment we choose γ = 0.99, θ = 1 and all samplings are uniform, i.e. p = 1/n.The number of subsets varies between n = 1 (deterministic case), 50 and 250.The other step size parameters are chosen as • PDHG and Pesquet and Repetti: . Some example reconstructed images are found in Figure 1 and quantitative results in Figures 2 and 3.It can be seen from the reconstructed images after 3 epochs in Figure 1 that both stochastic algorithms are much faster than the deterministic PDHG.The statements of Theorem 4.3 are numerically validated in Figure 2 which shows that the distance to a saddle point is bounded and that the ergodic    Bregman distance converges with rate 1/k.All five algorithms are compared quantitatively in Figure 3.In both, objective function value and dual Bregman distances, the stochastic algorithms are faster with SPDHG outperforming the algorithm of Pesquet and Repetti.In addition, two other observations can be made.First, SPDHG has empirically less variance than the algorithm of Pesquet and Repetti seen by the smoothness of the curves.Second, the stochastic algorithms differ a lot in terms of the convergence speed of the dual Bregman distance.

TV denoising with Gaussian Noise (Primal Acceleration
).In the second example we consider denoising of an image that is degraded by Gaussian noise with the help of the anisotropic TV.This can be achieved by solving the optimization problem min where X = R d1×d2 , the data fit is the squared Euclidean norm and the prior the (anisotropic) TV.The gradient is again discretized by forward differences, cf.e.g.[12].Instead of the isotropic TV as in the previous example we consider here the anisotropic version as it is separable in the direction of the gradient.Dualizing the anisotropic TV yields the saddle point problem where Y = X 2 and ı B is the characteristic function of the unit ball with respect to the ∞-norm, i.e.
For the primal-dual algorithms we need the proximal operators of the characteristic function and the squared 2-norm.These are given by the point-wise projection onto the unit interval prox σ ı B ( ) and a shifted scaling prox The regularization parameter is chosen to be α = 0.12.Parameters.In this experiment we choose γ = 1 and the sampling to be uniform, i.e. p = 1/n.The number of subsets are either n = 1 in the deterministic case or n = 2 in the stochastic case.The (initial) step size parameters are chosen as • PDHG: For the accelerated version, the step sizes vary with the iteration with the primal step size getting smaller and the dual step size getting larger.The extrapolation factor is chosen to be θ = 1 for the non-accelerated algorithms and converges to one for the accelerated versions.
Results.By visual assessment of the denoised images in Figure 4, it is easy to conclude that the accelerated algorithms are much faster than the non-accelerated version.Moreover, it can be seen that the stochastic variant of the accelerated algorithm is even faster than the deterministic version as the the sky is more uniform.
The quantitative results in Figure 5 confirm the visual conclusions.In addition, they show that the accelerated SPDHG indeed converges as 1/K 2 in the norm of the primal part.

Huber-TV Deblurring with Unknown Boundary (Dual Acceleration).
In the third example we consider the deblurring of an image with known convolution kernel but we do not assume to have knowledge about the boundary of the image [3].The latter condition is very natural as no artificial boundary condition (zero, constant, periodic etc) will be met in a practical setting.Following the mathematical model of [3] the forward operator is modeled as A : X → R N , , X =  The noise is modeled to be Poisson with a constant background of 30 compared to the approximate data mean of 478.96.We further assume to have the knowledge that the reconstructed image should be non-negative and upper-bounded by 100.We want to solve the following minimization problem Here the data fidelity f 1 is the Kullback-Leibler divergence (27) where the sum is taken over all pixels.The prior information is the anisotropic TV with Huberized norm with Huber parameter η = 1, regularization parameter α = 0.1 and constraint set C = {x ∈ X : 0 ≤ x i,j ≤ 100}.
By the nature of the forward operator we have that Ax ≥ 0 whenever x ≥ 0. Therefore the solution to the optimization problem remains the same if we replace the Kullback-Leibler divergence by the differentiable which has a max i b i /r 2 i Lipschitz continuous gradient.The Lipschitz constant is well-defined and non-zero as both the data b i as well as the background r i are positive.
To obtain the saddle point problem we dualize data term and Huberized TV and obtain the saddle point problem over X and Y The convex conjugate of the modified Kullback-Leibler divergence ( 30) is which can readily seen to be min i r 2 i bi -strongly convex.Moreover, for the algorithm we also need its proximal operator Note that (αξ η ) * (y) = ı αB (y i ) − η 2α y i 2 2 where ı αB is the indicator function of the ∞ ball with radius α around 0, cf.(29).
The norms of the directional derivatives are 2 and the norm of the blurring operator was estimated to be 10.45 by the power method.
Parameters.In this experiment we choose γ = 1 and consider both uniform (p i = 1/n) and importance sampling (p i = A i / j A j ).The latter will have probabilities proportional to the norm of the operators.The number of subsets are either n = 1 in the deterministic case or n = 3 in the stochastic case.The (initial) step size parameters are chosen to be 6 shows the data and reconstructed images from PDHG, PA-PDHG and DA-SPDHG with importance sampling after 100 epochs.It is easy to see that while PDHG has not restored the contrast yet and both deterministic variants are still relatively blurry, DA-SPDHG with importance sampling yields a sharp reconstruction.The quantitative results in Figure 7 confirm the visual assessment.The combination of importance sampling and acceleration yields a significant speed up in both quality measures.7.4.PET Reconstruction (Linear Rate).For the final example we turn back to PET reconstruction but this time with linear rate.This means we want to solve the same minimization problem as in the first example, but now we replace the Kullback-Leibler functional by its modified version as in the previous example.We note again that this does not change the solution of the minimization problem.Moreover, to make the TV strongly convex we add another regularization µ 2 x 2 2 .Note that the proximal operator of the TV (indeed any functional) with added squared 2 -norm, i.e.We will consider two different settings.First, all views of the PET forward operator are equally divided making it unnecessary to consider samplings other than uniform.Second, to test the impact of sampling, we choose one subset to contain half the number of views and divide the remaining views uniformly among the remaining subsets.This imbalanced setting should make it crucial to consider non-uniform sampling strategies.As can be seen in terms of distance to the saddle point (left) and the relative objective function value (right) the stochastic variants are all much faster than the deterministic PDHG.Moreover, the graph on the left numerically verifies the result of Theorem 6.1 as the empirical solid curves lie all below their provable worst case upper bound (dashed).Parameters.In this experiment we choose ρ = 0.99 and the sampling to be either uniform or optimal.The other step size parameters are chosen as derived in subsection 6.2.
Results.The visual results in Figure 8 show that SPDHG is much faster than the deterministic PDHG and that the speed increases by increasing the number of subsets.This is quantitatively confirmed in Figure 9 in terms of distance to the saddle point and objective function value.Moreover, the impact of optimal sampling is apparent in Figure 10 which shows the performance of SPDHG in the imbalanced setting.
8. Conclusions and Future Work.We proposed a natural stochastic generalization of the deterministic PDHG algorithm to convex-concave saddle point problems that are separable in the dual variable.The analysis was carried out in the context of arbitrary samplings which enabled us to obtain known deterministic convergence results as special cases.We proposed optimal choices of the step size parameters with which the proposed algorithm showed superior empirical performance on a variety of optimization problems in imaging.
In the future, we would like to extend the analysis to include iteration dependent (adaptive) probabilities [17] and strong convexity parameters to further exploit the structure of many relevant problems.Moreover, the present optimal sampling strategies are only for scalar-valued step sizes and serial sampling.In the future, we wish to extend this to other sampling strategies such as multi-block or parallel sampling.
Appendix A. Auxiliary Results for the Proofs.
Proof of Lemma 4.2.Denote by ŷk i the value of y k if it got updated (and thus is not random with respect to S k ).Furthermore, as y k i = y k−1 i for i ∈ S k and y k i = ŷk i if i ∈ S k we have by completing the norm for any x ∈ X x, A * P −1 (y k − y k−1 ) = x, Moreover, we now estimate the expectation of the second term of the right hand side of (31) where the first inequality is due to the ESO inequality (4), the second inequality holds by the definition of γ and the last equation holds due to Corollary A.4. Combining the expected value of inequality (31) with inequality (32) yields the assertion.where we used the definition of the inner product and the norm on the product space Y.
It now suffices to complete the Bregman distances D g g (x k+1 , x ) = g(x k+1 ) − g(x ) − −A * y , x k+1 − x and D f f * (ŷ k+1 , y ) = Lemma A.2 (Stochastic inequality).Let x k+1 , ŷk+1 be defined as in Lemma A.1, y k+1 , y k+1 as in Algorithm Algorithms 2 and 3 and (x , y ) be a saddle point.Moreover, let γ 2 bound some ESO parameters of (S k ) 1/2 P −1/2 A(T k ) 1/2 .Then Proof.We adapt the result of Lemma A.1 to the stochastic setting suitable for strongly convex functionals.In addition, all step size parameters may vary along the iterations.First, by Lemma A.1 and the strong convexity of f and g, which implies

Figure 1 .
Figure1.PET reconstruction with TV prior solved as a non-strongly convex problem.Results after 3 epochs with uniform sampling of 250 subsets.From left to right: approximate primal part of saddle point, PDHG, Pesquet and Repetti[35] and SPDHG.With the same number of operator evaluations both stochastic algorithms make much more progress towards the saddle point.

Figure 2 .
Figure 2. Numerical evaluations (solid line) of the results in Theorem 4.3.Both the distance to a saddle point (6) (left) and the ergodic Bregman distance (8) (right) behave as predicted by the analysis (dashed line), however, the bounds are not sharp for this example.

Figure 3 .
Figure 3.Comparison of PDHG, SPDHG and the algorithm of Pesquet and Repetti [35] in terms of relative objective Φ(x K )−Φ Φ(x 0 )−Φ (left) and dual Bregman distance D f f * (y K , y ) (right).Both graphs agree that the proposed algorithm converges faster than the algorithm of Pesquet and Repetti.Moreover, the graph of the relative objective function values (left) indicates that the proposed algorithm has also a much smaller variance compared to the algorithm of Pesquet and Repetti.

Figure 6 .
Figure 6.Deblurring with total variation regularization with Poisson noise and unknown boundaries.Results after 100 epochs.Top: From left to right: Blurry and noisy data with kernel (magnified), PDHG, DA-PDHG and DA-SPDHG (importance sampling, 3 subsets).Bottom: Close-ups of top row.

Figure 7 .
Figure 7. Quantitative results for deblurring.In both quality measures (left: distance to dual part of the saddle point 1 2 y K − y 2 and right: peak signal-to-noise ratio (PSNR)) the accelerated algorithms are faster than the non-accelerated counterparts.While uniform sampling does not speed up the convergence for this example, the combination of importance sampling with acceleration yields the by far fastest algorithm.

Figure 8 .
Figure 8. PET reconstruction with a strongly convex TV prior and uniform sampling.Results after 5 epochs.Left: approximate primal part of saddle point Right: From left to right: PDHG, SPDHG (10 subsets) and SPDHG (250).It is clear that with more subsets, the reconstruction becomes closer to the desired solution.

Figure 9 .
Figure 9. Comparisons of PDHG and SPDHG with uniform sampling for PET reconstruction.As can be seen in terms of distance to the saddle point (left) and the relative objective function value (right) the stochastic variants are all much faster than the deterministic PDHG.Moreover, the graph on the left numerically verifies the result of Theorem 6.1 as the empirical solid curves lie all below their provable worst case upper bound (dashed).

Figure 10 .
Figure 10.Comparison of different samplings for SPDHG with imbalanced subsets.As it can be seen from either graph, non-uniform subset selections require a non-uniform sampling for improved convergence speeds.