Tensor Methods for Minimizing Convex Functions with H\"{o}lder Continuous Higher-Order Derivatives

In this paper we study $p$-order methods for unconstrained minimization of convex functions that are $p$-times differentiable ($p\geq 2$) with $\nu$-H\"{o}lder continuous $p$th derivatives. We propose tensor schemes with and without acceleration. For the schemes without acceleration, we establish iteration complexity bounds of $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ for reducing the functional residual below a given $\epsilon\in (0,1)$. Assuming that $\nu$ is known, we obtain an improved complexity bound of $\mathcal{O}\left(\epsilon^{-1/(p+\nu)}\right)$ for the corresponding accelerated scheme. For the case in which $\nu$ is unknown, we present a universal accelerated tensor scheme with iteration complexity of $\mathcal{O}\left(\epsilon^{-p/[(p+1)(p+\nu-1)]}\right)$. A lower complexity bound of $\mathcal{O}\left(\epsilon^{-2/[3(p+\nu)-2]}\right)$ is also obtained for this problem class.


Introduction.
1.1. Motivation. In [13], it was shown that a suitable cubic regularization of the Newton method (CNM) takes at most O(ǫ −1/2 ) iterations to reduce the functional residual below a given precision ǫ > 0 when the objective is a twice-differentiable convex function with a Lipschitz continuous Hessian. A better complexity bound of O(ǫ −1/3 ) was shown in [14] for an accelerated version of CNM. Auxiliary problems in these methods consist in the minimization of a third-order regularization of the second-order Taylor approximation of the objective function around the current iterate. A natural generalization is to consider auxiliary problems in which one minimizes a (p + 1)-order regularization of the pth-order Taylor approximation of the objective function, resulting in tensor methods. Unconstrained optimization by tensor methods is not a new subject (see, for example, [17,5]). In the context of convex optimization, accelerated tensor methods (as described above) were first considered by Baes [2]. However, the author did not realize that under a proper choice of the regularization coefficient the auxiliary problems become convex. This important observation was done in a recent paper [15], where tensor methods with and without acceleration were proposed for unconstrained minimization of p-times differentiable convex functions with Lipschitz continuous pth derivatives. An iteration complexity bound of O(ǫ −1/p ) was proved for the method without acceleration, while an improved bound of O(ǫ −1/(p+1) ) was proved for the accelerated tensor method.
In the present paper, we study tensor methods (with and without acceleration) that can handle convex functions with ν-Hölder continuous pth derivatives (p ≥ 2) and allow the inexact solution of auxiliary problems (in the sense of [4]). Specifically, our contribution is threefold: 1. For the schemes without acceleration, we establish iteration complexity bounds of O ǫ −1/(p+ν−1) for reducing the functional residual below a given ǫ ∈ (0, 1).
2. Assuming that ν is known, we obtain an improved complexity bound of O ǫ −1/(p+ν) for the corresponding accelerated scheme. For the case in which ν is unknown, we present a universal accelerated tensor scheme with iteration complexity of O ǫ −p/[(p+1)(p+ν−1)] . −2] is also obtained, from which we conclude that our accelerated nonuniversal scheme is nearly optimal.

A lower complexity bound of
The methods and results presented here extend in a significant way the contributions in [2,8,9,15]. Indeed, [8,9] deal only with second-order schemes (p = 2) which require the exact solution of the auxiliary problems. On the other hand, the p-order methods proposed in [2,15] are restricted to the Lipschitz case (ν = 1), assuming that the Lipschitz constant is known and that the auxiliary problems are solved exactly. We believe that the development of p-order methods with affordable trial steps and automatic adjustment to the objective's function class (universality) constitutes a fundamental step towards implementable high-order methods for convex optimization.
1.2. Contents. The paper is organized as follows. In Section 2, we define our problem. In Section 3, we present tensor methods without acceleration and establish their convergence properties. In Section 4, we present complexity results for accelerated schemes. Finally, in Section 5 we obtain lower complexity bounds for tensor methods under the Hölder condition. All necessary auxiliary results are included in Appendix A.

Notations and Generalities.
In what follows, we denote by E a finitedimensional real vector space, and by E * its dual space, composed by linear functionals on E. The value of function s ∈ E * at point x ∈ E is denoted by s, x . Given a selfadjoint positive definite operator B : E → E * (notation B ≻ 0), we can endow these spaces with conjugate Euclidean norms: For a smooth function f : dom f → R with convex and open domain dom f ⊂ E, denote by ∇f (x) its gradient and by ∇ 2 f (x) its Hessian evaluated at point For any integer p ≥ 1, denote by the directional derivative of function f at x along directions h i ∈ E, i = 1, . . . , p. In particular, for any x ∈ dom f and h 1 , h 2 ∈ E we have Then the pth-order Taylor approximation of function f at x ∈ dom f can be written as follows: Note that D p f (x)[ . ] is a symmetric p-linear form. Its norm is defined by In fact, it can be shown that (see, e.g., [3]) is also a symmetric p-linear form for fixed x, y ∈ dom f , we can define 2. Problem Statement. In this paper we consider methods for solving the following minimization problem where f : E → R is a convex p-times differentiable function (p ≥ 2). We assume that there exists at least one optimal solution x * ∈ E for problem (2.1). Let us characterize the level of smoothness of the objective f by the system of Hölder constants Then, from (2.2) and from the integral form of the mean-value theorem, it follows that This property motivates the use of the following class of models of f around x ∈ E: In particular, as long as H ≥ H f,p (ν), by (2.5) we have x,p,H (y), y ∈ E.
3. Tensor schemes without acceleration. If we assume that H f,p (ν) < +∞ for some ν ∈ [0, 1], there are two possible situations: either ν is known, or ν is unknown. We cover both cases in a single framework by introducing parameter Let ǫ ∈ (0, 1) be the target precision. At the beginning of the tth iteration one has an estimate x t for the solution of (2.1) and a scaling coefficient M t . A trial point x + t is computed as an approximate solution to the auxiliary problem with α given by (3.1). Similarly to [4], the trial point x + t must satisfy the following conditions: where θ ≥ 0 is a user-defined parameter. When (3.2) is not convex, then x + t is not necessarily an approximation of its global solution. If the descent condition holds, then x + t is accepted and we define x t+1 = x + t . Otherwise, constant M t is increased until the corresponding trial point x + t is accepted. We will see that this process is well defined in the sense that there exists M ν > 0 such that M t ≤ M ν for all t. This general scheme can be summarized in the following way.
Step 3. Set t := t + 1 and go back to Step 1.
Remark 3.1. Regarding the approximate solution of the auxiliary problems, it is easy to see that x + t satisfying (3.3) can be computed by any monotone optimization scheme that drives the gradient of the objective to zero. In [10], we investigated the possible use of gradient methods with Bregman distance for the case p = 3 and α = ν. Specifically, under assumptions H1 and H2 below, we showed that if {y k } T k=0 is a sequence generated by these methods applied to min y∈E Ω For more details, see Theorems 3.8 and 3.10 in [10].
To analyze the convergence of Algorithm 1, we introduce the following assumptions: and assume that m < T . Then and, for all k, m < k ≤ T , we have Proof. By Step 1 in Algorithm 1, we have Thus, in view of (3.4), (3.7), and H2, for k = 0, . . . , T − 1 we have where the last inequality is due to the convexity of f . Now, denoting we see from (3.9) that this sequence satisfies condition (1.1) of Lemma 1.1 in [8] with u = p+α p+α−1 . Note that m is the first iteration for which δ m ≤ 2. Using Lemma 1.1 in [8], this inequality allow us to obtain the upper bound (3.5) for m and also simplifies our final bound for the functional residual. Indeed, if m > 0, then δ 0 > 2 and, in view of inequality (1.2) of Lemma 1.1 in [8], we have , and so (3.5) holds. Consequently, from inequality (1.3) of Lemma 1.1 in [8] we get the following rate of convergence: If we assume that ν and H f,p (ν) are known, by Lemma A.2, we can set Here, by (3.1) the corresponding version of Algorithm 1 takes at most is not known. To deal with this situation, we can consider the following adaptive version of Algorithm 1:
Step 1.1 Compute an approximate solution x + t,i to min Step holds, set i t := i and go to Step 2. Otherwise, set i := i + 1 and go to Step 1.1. Step Step 3. Set t := t + 1 and go to Step 1.
Let us define the following function of ǫ > 0: The next lemma provides upper bounds on H t and on the number of calls of the oracle 1 . Lemma 3.3. Suppose that H1 and H2 are true. Given ǫ > 0, assume that {x t } T t=0 is a sequence generated by Algorithm 2 such that Then, Moreover, the number O T of calls of the oracle after T iterations is bounded as follows: Proof. Let us prove (3.14) by induction. Clearly it holds for t = 0. Assume that (3.14) is true for some t, 0 ≤ t ≤ T − 1. If ν is known, then by (3.1) we have α = ν. Thus, by H1 and Lemma A.2 the final value of 2 it H t cannot exceed since otherwise we should stop the line search earlier. Therefore, that is, (3.14) holds for t = t + 1.
On the other hand, if ν is unknown, we have α = 1. In view of (3.11), (3.12) and H2, it follows that Thus, by (3.13) and Lemma A.5 in [9] we have ∇f ( . In this case, it follows from Corollary A.
Consequently, we also have that is, (3.14) holds for t + 1. This completes the induction argument. Finally, note that at the kth iteration of Algorithm 2, the oracle is called i k + 1 times.
From Lemma 3.3, we see that Algorithm 2 is a particular case of Algorithm 1 in which Thus, combining Theorem 3.2 and Lemma 3.3, we obtain the following result.
Theorem 3.4. Suppose that H1 and H2 are true. Given ǫ ∈ (0, 1), assume that {x t } T t=0 is a sequence generated by Algorithm 2 such that Denote by m the first iteration number such that and assume that m < T . Then Consequently, if ν is known, Proof. As mentioned above, by Lemma 3.3 we have Then, (3.19) and (3.20) follow directly from Theorem 3.2 with and so, If ν is known, then α = ν and, by (3.10), we have Thus, combining (3.22) and (3.23), we get (3.21). On the other hand, if ν is unknown, then α = 1 and, by (3.10), (3.16) and ǫ ∈ (0, 1), we have In this case, combining (3.22) and (3.24) we also get (3.21).
Remark 3.5. Note that for any θ in the interval the corresponding right-hand side in (3.21) has the same value as for θ = 0.
Note that Algorithm 2 with α = 1 is a universal scheme: it works for any Hölder parameter ν ∈ [0, 1] without using it explicitly. This algorithm can be viewed as a generalization of the universal method (6.10) in [8]. Looking at the efficiency bound (3.21), for ν known and ν unknown, we see that the universal scheme ensures the same dependence on the accuracy ǫ as the nonuniversal scheme (α = ν = 1). Remarkably, this is not true for the accelerated schemes obtained from the standard estimating sequences technique, as we will see in the next section.
4. Accelerated tensor schemes. Similarly to Section 3, we shall consider a general accelerated tensor method parametrized by the constant α given in (3.1). Specifically, at the beginning of the tth iteration (t > 0) one has an estimate x t for the solution of (2.1), an auxiliary vector v t and constants A t , M t > 0. A new vector y t is computed as a convex combination of x t and v t : with a t > 0 being computed from the equation Then, a trial point x + t is computed as an approximate solution to the auxiliary problem such that where θ ≥ 0 is a user-defined parameter. If the descent condition is satisfied, then x + t is accepted, and we define x t+1 = x + t . Otherwise, constant M t is increased until the corresponding trial point x + t is accepted. As in Algorithm 1, we assume that there exists M ν > 0 such that M t ≤ M ν for all t. After obtaining x t+1 , we set A t+1 = A t + a t and compute To initialize, we choose x 0 and we set v 0 = x 0 , A 0 = 0, and ψ 0 (x) = 1 p+α x − x 0 p+α . This general scheme can be summarized in the following way.
Step 2. Set x t+1 = x + t and A t+1 = A t + a t , with a t > 0 obtained from (4.3).
Step 4. Set t := t + 1 and go back to Step 1.
Regarding the computation of a t , for t = 0, (4.3) gives For t > 0, we have A t > 0. Thus, the computation of a t requires the solution of a univariate nonlinear equation of the form Denoting is continuous, we can use the bisection Method to compute an approximation to a * ∈ (0, c] such that g(a * ) = 0. As can be seen in the proof of Theorem 4.3, our convergence results only require The next result establishes the relationship between the estimating functions ψ t (·) and the objective function f (·).
Lemma 4.2. For all t ≥ 0, Proof. We prove this result by induction in t.
that is, (4.9) is true for t = 0. Suppose that (4.9) is true for some t ≥ 0. Then (4.8) and the convexity of f imply that, for all x ∈ E, .
Thus, (4.9) is also true for t + 1, and the proof is completed. The theorem below establishes the global convergence rate for Algorithm 3. Theorem 4.3. Assume that H1 is true and let the sequence {x t } T t=0 be generated by Algorithm 3. Then, for t = 2, . . . , T , Proof. Let us prove by induction that Since A 0 = 0, we have A 0 f (x 0 ) = 0 = min x∈E ψ 0 (x). Thus, (4.11) is true for t = 0. Assume that it is true for some t ≥ 0. Note that for any x ∈ E we have Note that ℓ t (x) is a linear function. Moreover, by Lemma 4 in [14], function 1 (p+α) x−x 0 p+α is uniformly convex of degree p+α with parameter 2 −(p+α−2) . Thus, ψ t (x) is also a uniformly convex function of degree p + α with parameter 2 −(p+α−2) . Therefore, Lemma A.2 in [9] and the induction assumption imply that Thus, Since f is convex and differentiable, we have Then, substituting this inequality above, we obtain Note that Moreover, A t+1 x t+1 = A t x t+1 + a t x t+1 , and so where the last inequality is due to (4.6). Thus, to prove that (4.11) is true for t + 1, it is enough to show that for all x ∈ E. Using Lemma 2 in [14] with r = p + ν, s = a t ∇f (x t+1 ) and ω = 2 −(p+α−1) , we see that a sufficient condition for (4.12) is . Therefore, by (4.3) we have .

Thus (4.11) is true for t + 1, completing the induction argument.
Let us now estimate the growth of the coefficients A t . Since M t ≤ M ν for all Consequently, Now, denoting B t =M A t for all t ≥ 0, it follows from (4.14) that Then, by Lemma A.4 in [9], we have Note that A 1 ≥ 1 2M . Thus, B 1 ≥ 1 and consequently Therefore, for all t ≥ 2, we have Finally, by (4.11) and Lemma 4.2, for t ≥ 0, we have , and (4.10) follows immediately from (4.13) and (4.15).
If we assume that ν and H f,p (ν) are known, then, by Lemma A.6, we can set Here, by (3.1) the corresponding version of Algorithm 3 takes at most is not known, let us consider the following adaptive version of Algorithm 3.
Step 1.1. Compute the coefficient a t,i > 0 by solving equation Step Step 1.3 Compute an approximate solution x + t,i to min x∈E Ω (α) yt,i,p,2 i Ht (x), such that yi,t,p,2 i Ht (x + t,i ) * ≤ θ x + t,i − y t,i p+α−1 .
Step 1.4. If condition , set i t := i and go to Step 2. Otherwise, set i := i + 1 and go back to Step 1.1.
Step 4. Set t := t + 1 and go back to Step 1.

Note that Algorithm 4 is a particular case of Algorithm 3 in which
Let us define the following function of ǫ > 0: is a sequence generated by Algorithm 4 such that . . , i t and t = 0, . . . , T. Then

Moreover, the number O T of calls of the oracle after T iterations is bounded as follows:
(4.20) Proof. Let us prove by induction that the scaling coefficients H t in Algorithm 4 satisfy (4.19). This is obvious for t = 0. Assume that (4.19) is true for some t ≥ 0. If α = ν, it follows from Lemma A.6 that the final value 2 it H t cannot be bigger than that is, (4.20) holds for t + 1. On the other hand, suppose that α = 1. In view of Lemma A.5 in [9], at any trial point x + t,i we have Thus, it follows from Lemma A.7 that Consequently, (4.22) known (i.e., α = ν), and (4.23) if ν is unknown (i.e., α = 1).
When ν = 1, bounds (4.22) and (4.23) have the same dependence on ǫ. However, when ν = 1, the bound of O ǫ −p/(p+1)(p+ν−1) obtained for the universal scheme (i.e., Algorithm 4 with α = 1) is worse than the bound of O ǫ −1/(p+ν) obtained for the nonuniversal scheme (α = ν). For high-order methods (p ≥ 2), to the best of our knowledge, there is no simple procedure by which one can identify the level of smoothness ν of the pth derivatives (in general). Therefore, despite this gap in the complexity bounds, we believe that the automatic choice of the best function subclass in the universal scheme is a very attractive feature. Moreover, in the nonuniversal scheme, for any θ > 0 with (p + ν − 1)(H f,p (ν) + θ(p − 1)!) > H 0 , the corresponding right-hand side of (4.22) has an additional term in comparison to its value when θ = 0. In contrast, in the accelerated universal scheme, for any θ in the interval the corresponding right-hand side in (4.23) is the same as for θ = 0. In this sense, it appears that the accelerated universal scheme is more robust than the accelerated nonuniversal scheme in terms of the inexact solution of the auxiliary problems.

Lower complexity bounds under Hölder condition.
In this section we investigate how much the convergence rates of our tensor methods can be improved with respect to problems satisfying H1. Specifically, we derive lower complexity bounds for p-order tensor methods applied to the problem (2.1), where the objective f is convex and H f,p (ν) < +∞ for some ν ∈ [0, 1].

Hard functions and Lower Complexity Bounds.
For simplicity, let us consider E = R n and B = I n . Given an approximationx for the solution of (2.1), p-order methods usually compute the next test point as x + =x +h, where the search directionh is the solution of an auxiliary problem of the form with a ∈ R p , γ > 0, and m > 1. Denote by Γx ,f (a, γ, m) the set of all stationary points of function φ a,γ,m ( . ), and define the linear subspace (5. 2) S f (x) = Lin (Γx ,f (a, γ, m) | a ∈ R p , γ > 0, m > 1) .
With this notation, we can characterize the class of p-order tensor methods by the following assumption.
Assumption 1. Given x 0 ∈ R n , the method generates a sequence of test points {x k } k≥0 such that Given ν ∈ [0, 1], our parametric family of difficult functions for p-order tensor methods is defined as The next lemma establishes that for each f k ( . ) we have H f k ,p (ν) < +∞.
Lemma 5.1. Given an integer k ∈ [2, n], the pth derivative of f k ( . ) is ν-Hölder continuous with Proof. In view of (5.4), we have It can be shown that (see page 13 in [15]) On the other hand, for any x, h ∈ R n , we have Therefore, for all x, y, h ∈ R n , it follows that Consequently, for all x, d, h ∈ R n , we have and, by (5.9), that Thus, combining (5.10)-(5.12), we get The next lemma provides additional properties of f k ( . ). Lemma 5.2. Given an integer k ∈ [2, n], let function f k ( . ) be defined by (5.4). Then, f k ( . ) has a unique global minimizer x * k . Moreover, Proof. The existence and uniqueness of x * k follows from the fact that f k ( . ) is uniformly convex. In view of (5.6), it follows from the first-order optimality condition that Therefore, A k x * k = y * k , where y * k satisfies (5.14) ∇η p+ν (y * k ) = A T k e 1 =ê k = ē k 0 n−k withē k ∈ R k being the vector of all ones and 0 n−k being the origin in R n−k . Note that (5.15) ∂η p+ν ∂y i (y) = |y (i) | p+ν−2 y (i) , i = 1, . . . , n.
Consequently, (5.14) is equivalent to 1, for i = 1, . . . , k, 0, for i = k + 1, . . . , n. Thus, and so where (τ ) + = max {0, τ }. Finally, combining (5.6), (5.7), (5.16) and (5.17) we get Our goal is to understand the behavior of the tensor methods specified by Assumption 1 when applied to the minimization of f k ( . ) with a suitable k. For that, let us consider the following subspaces: Lemma 5.3. For any q ≥ 0 and x ∈ R n k , f k+q (x) = f k (x). Proof. It follows directly from (5.4). Lemma 5.4. Let M be a p-order tensor method satisfying Assumption 1. If M is applied to the minimization of f t ( . ) starting from x 0 = 0, then the sequence {x k } k≥0 of test points generated by M satisfies Proof. See Lemma 2 in [15]. Now, we can prove the lower complexity bound for p-order tensor methods applied to the minimization of functions with ν-Hölder continuous pth derivatives.
Theorem 5.5. Let M be a p-order tensor method satisfying Assumption 1. Assume that for any function f with H f,p (ν) < +∞ this method ensures the rate of convergence: where {x k } k≥0 is the sequence generated by method M and x * is a global minimizer of f . Then, for all t ≥ 1 such that 2t + 1 ≤ n we have . Proof. Let us apply method M for minimizing function f 2t+1 ( . ) starting from point x 0 = 0. By Lemma 5.4 we have x i ∈ R n t for all i, 0 ≤ i ≤ t. Moreover, by Lemma 5.3 we have Thus, from (5.18), (5.21), Lemma 5.1 and Lemma 5.2 we get where constant C p,ν is given by (5.20 , which is the bound in [12]. corresponds to a worst-case complexity bound of O ǫ −2/[3(p+ν) −2] iterations necessary to ensure f (x k ) − f (x * ) ≤ ǫ. This means that the nonuniversal accelerated schemes proposed in this paper (e.g., Algorithm 4 with α = ν) are nearly optimal tensor methods. In fact, for ǫ ∈ (0, 1), note that 1 ǫ  Notice that the lower-bound described in Theorem 5.5 is only valid while the iteration counter t satisfies t < 1 2 (n − 1), where n is the dimension of the domain of the objective. The same condition on t appears in other lower bounds in the literature for the case p = 1 and ν = 1 (see, e.g., Theorem 2.1.7 in [16]).
6. Conclusion. In this paper, we presented p-order methods for unconstrained minimization of convex functions that are p-times differentiable with ν-Hölder continuous pth derivatives. For the universal and the nonuniversal schemes without acceleration, we established iteration complexity bounds of O ǫ −1/(p+ν−1) for reducing the functional residual below a given ǫ ∈ (0, 1). Assuming that ν is known, we obtained an improved complexity bound of O ǫ −1/(p+ν) for the corresponding accelerated scheme. For the case in which ν is unknown, we presented an accelerated universal tensor scheme with an iteration complexity of O ǫ −p/[(p+1)(p+ν−1)] .
Finally, a lower complexity bound of O(ǫ −2/[3(p+ν) −2] ) was also obtained for the referred problem class. This means that, in practice, our accelerated nonuniversal schemes are nearly optimal. Remarkably, the complexity bound obtained for the accelerated universal schemes is slightly worse than the bound obtained for the nonuniversal accelerated schemes. Up to now, it is not clear whether the estimating sequences technique can be modified to provide an accelerated universal p-order method with a complexity bound of O ǫ −1/(p+ν) .
It is worth mentioning that the study of high-order methods is still at its early stages, with the majority of recent works in this area focusing on the derivation of global complexity bounds (see, e.g., [2,4,6,7,11,15]). These bounds predict that high-order methods with p ≥ 3 may require significantly fewer calls of the oracle than second-order methods. As pointed out in [7,15], the computation of high-order derivatives may be affordable for structured objectives (such as separable functions). Moreover, at least for p = 3 and α = ν, the auxiliary problems can be solved using Bregman gradient methods that also take into account their particular structure [10,15]. Nevertheless, the practical impact of high-order methods is yet to be seen.