Image Labeling Based on Graphical Models Using Wasserstein Messages and Geometric Assignment

We introduce a novel approach to Maximum A Posteriori inference based on discrete graphical models. By utilizing local Wasserstein distances for coupling assignment measures across edges of the underlying graph, a given discrete objective function is smoothly approximated and restricted to the assignment manifold. A corresponding multiplicative update scheme combines in a single process (i) geometric integration of the resulting Riemannian gradient flow and (ii) rounding to integral solutions that represent valid labelings. Throughout this process, local marginalization constraints known from the established LP relaxation are satisfied, whereas the smooth geometric setting results in rapidly converging iterations that can be carried out in parallel for every edge.

Let Ω ⊂ R 2 be a domain where image data are observed, and let G = (V, E), |V| = m, denote a grid graph embedded into Ω.Each vertex i ∈ V indexes the location of a pixel, to which a random variable x i ∈ X = { 1 , . . ., n } (1.1) is assigned which takes values in a finite set X of labels.The image labeling problem is the task to assign to each x i a label such that the discrete objective function is minimized.This function comprises for each pixel i ∈ V local energy terms E i (x i ) that evaluate local label predictions for each possible value of x i ∈ X .In addition, E(x) comprises for each edge ij ∈ E local distance functions E ij (x i , x j ) that evaluate the joint assignment of labels to x i and x j .If the local energy functions E ij (x i , x j ) = d(x i , x j ) are defined by a metric d : X × X → R, then (1.2) is called the metric labeling problem [KT02].In general, the presence of these latter terms makes image labeling a combinatorially hard task.We refer to [KAH + 15] for a recent survey on the image labeling problem and on algorithms for solving either approximately or exactly problem (1.2).
A major class of algorithms for approximately solving (1.2) is based on the linear (programming) relaxation [Wer07] (see Section 2.2 for details) min µ∈L G θ, µ .
(1.3) Solving the linear program (LP) (1.3) returns a globally optimal relaxed indicator vector µ whose components take values in [0, 1].If µ is a binary vector, then it corresponds to a solution of problem (1.2).In realistic applications, this is not the case, however, and the relaxed solution µ has to be rounded to an integral solution in a post-processing step.
In this paper, we present an alternative inference algorithm that deviates from the traditional two-step process: convex relaxation and rounding.It is based on the recently proposed geometric approach [ ÅPSS17] to image labeling.The basic idea underlying this approach is to restrict indicator vector fields to the relative interior of the probability simplex, equipped with the Fisher-Rao metric, and to regularize label assignments by iteratively computing Riemannian means (see Section 3 for details).This results in a highly parallel, multiplicative update scheme, that rapidly converges to an integral solution.Because this model of label assignment does not interfere with data representation, the approach applies to any data given in a metric space.The recent paper [BFPS17] reports the application of our scheme to a range of challenging labeling problems of manifold-valued data.
Adopting this starting point, the objectives of the present paper are: • Show how the approach [ ÅPSS17] can be used to efficiently compute high-quality (low-energy) solution for an arbitrary given instance of the labeling problem (1.2). • Devise a novel labeling algorithm that tightly integrates both relaxation and rounding to an integral solution in a single process.• Stick to the smooth geometric model suggested by [ ÅPSS17] so as to overcome the inherent nonsmoothness of convex polyhedral relaxations and the slow convergence of corresponding first-order iterative methods of convex programming.Regarding the last point, a key ingredient of our approach is a smooth approximation of problem (1.3), where d θ ij ,τ denotes the local smoothed Wasserstein distance between the discrete label assignment measures µ i , µ j coupled along the edge ij of the underlying graph.Besides achieving the degree of smoothness required for our geometric setting, this approximation also properly takes into account the regularization parameters that are specified in terms of the local energy terms E ij of the labeling problem (1.2).Our approach restricts the function E τ to the so-called assignment manifold and iteratively determines a labeling by tightly combining geometric optimization with rounding to an integral solution in a smooth fashion.
1.2.Related Work.Problem sizes of linear program (LP) (1.3) are large in typical applications of image labeling, which rules out the use of standard LP codes.In particular, the theoretically and practically most efficient interior point methods based on self-concordant barrier functions [NN87,Ren95] are infeasible due to the dense linear algebra steps required to determine search and update directions.Therefore, the need for dedicated solvers for the LP relaxation (1.3) has stimulated a lot of research.A prominent example constitute subclasses of objective functions (1.2) as studied in [KZ04], in particular binary submodular functions, that enable to reformulate the labeling problem as maximum-flow problem in an associated network and the application of discrete combinatorial solvers [BVZ01,BK04].
Since the structure of such algorithms inherently limits fine-grained parallel implementations, however, belief-propagation and variants [YFW05] have been popular among practitioners.These fixed point schemes in terms of dual variables iteratively enforce the so-called local polytope constraints that define the feasible set of the LP relaxation (1.3).They can be efficiently implemented using 'message passing' and exploit the structure of the underlying graph.Although convergence is not guaranteed on cyclic graphs, the performance in practice may be good [YMW06].The theoretical deficiencies of basic belief propagation in turn stimulated research on convergent message passing schemes, either using heuristic damping or utilizing in a more principled way convexity.Prominent examples of the latter case are [WJW05,HS10].We refer to [KAH + 15] for many more references and a comprehensive experimental evaluation of a broad range of algorithms for image labeling.
The feasible set of the relaxation (1.3) is a superset of the original feasible set of (1.2).Therefore, globally optimal solutions to (1.3) generally do not constitute valid labelings but comprise non-integral components µ i (x i ) ∈ (0, 1), x i ∈ X , i ∈ V. Randomized rounding schemes for converting a relaxed solution vector µ to a valid labeling x ∈ X m , along with suboptimality bounds, were studied in [KT02,CKNZ05].The problem to infer components x * i of the unknown globally optimal combinatorial labeling that minimizes (1.2), through partial optimality and persistency, was studied in [SSK + 16].We refer to [Wer07] for the history and more information about the LP relaxation of labeling problems, and to [WJ08] for connections to discrete probabilistic graphical models from the variational viewpoint.
Optimal transport and the Wasserstein distance have become a major tool of signal modeling and analysis [KPT + 17].Regarding the finite-dimensional formulation in terms of linear programs, the design of efficient algorithms for large-scale problems requires sophisticated techniques [Sch16].The problems of discrete optimal transport studied in this paper, in connection with the local Wasserstein distances of (1.4), have a small or moderate size (n 2 : number of labels squared).We apply the standard device of enhancing convexity through entropic regularization, which increases smoothness in the dual domain.We refer to [Sch90] and [Bru06, Ch. 9] for basic related work and the connection to matrix scaling algorithms and the history.Smoothing of the Wasserstein distance and Sinkhorn's algorithm has become popular in machine learning due to [Cut13a].The authors of [Pey15,CP16] comprehensively investigated barycenters and interpolation based on the Wasserstein distance.Our approach to image labeling, in conjunction with the geometric approach of [ ÅPSS17], is novel and elaborates the preliminary announcement [ ÅHS + 17].
1.3.Contribution and Organization.We collect basic notation, background material and details of the LP relaxation (1.3) in Section 2. Section 3 summarizes the basic concepts of the geometric labeling approach of [ ÅPSS17], in particular the so-called assignment manifold, and the general framework of [SH Å+ 17] for numerically integrating Riemannian gradient flows of functionals defined on the assignment manifold.This section provides the basis for the two subsequent sections that contain our main contribution.
Section 4 studies the approximation (1.4) and provides explicit expressions for the Riemannian gradient the restriction of E τ to the assignment manifold.A key property of this set-up concerns the local polytope constraints that define the feasible set L G of the LP relaxation (1.3): by construction, they are always satisfied throughout the resulting iterative process of label assignment.Thus, our formulation is both more tightly constrained and smooth, in contrast to the established convex programming approaches based on (1.3).
Section 5 details the combination of all ingredients into a single, smooth, geometric approach that performs simultaneously minimization of the objective function (1.4) and rounding to an integral solution (label assignment).This tight integration is a second major property that distinguishes our approach from related work.Section 5 also explains the notion 'Wasserstein distances' in the title of this paper due to the dual variables that are numerically utilized to evaluate gradients of local Wasserstein distances, akin to how dual (multiplier) variables in basic belief propagation schemes are used to enforce local marginalization constraints.Unlike the latter computations they have the structure of message passing on a dataflow architecture, however, message passing induced by our approach is fully parallel along all edges of the underlying graph and hence resembles the structure of numerical solvers for PDEs.
The remaining two sections are devoted to numerical evaluations of our approach.To keep this paper at a reasonable length, we merely consider the most elementary iterative update scheme, based on the geometric integration of the Riemannian gradient flow with the (geometric) explicit Euler scheme.The potential of the framework outlined by [SH Å+ 17] for more sophisticated numerical schemes will be explored elsewhere.Furthermore, working out any realistic application is beyond the scope of this paper.Rather, the experimental results are supposed to demonstrate major properties of our approach.
Section 6 provides all details of our implementation that are required to reproduce our computational results.Section 7 reports and discussed the results of four types of experiments: (1) The interplay between two parameters τ and α that control smoothness of the approximation (1.4) and rounding, respectively, is studied.In order to miminize efficiently (1.2), the Riemannian flow with respect to the smooth approximation (1.4) must reveal proper descent directions.This imposes an upper bound on the smoothing parameter τ .Naturally, the effect of rounding has to be stronger to make the iterative process converge to an integral solution.A corresponding choice of α controls the compromise between quality of integral labelings in terms of the energy (1.4) and speed of convergence.Fortunately, the upper bound on τ is large enough to achieve attractive convergence rates.
(2) We comprehensively explore numerically the entire model space of the minimal binary graphical model on the cyclic triangle graph K 3 , whose relaxation in terms of the so-called local polytope already constitutes a superset of the marginal polytope as admissible set for valid integral labelings.
In this way, we explore the performance of our approach in view of the LP relaxation and established inference based on convex programming, and with respect to the (generally intractable) feasible set of integral solutions.Corresponding phase diagrams display and support quantitatively the trade-off between accuracy of optimization and rate of convergence through the choice of the single parameter α.
(3) A labeling problem of the usual size was conducted to confirm and demonstrate that the finding of the preceding points for 'all' models on K 3 also hold in a typical application.A comparison to sequential tree-reweighted message passing (TRWS) [Kol06] which defines the state of the art, and to loopy belief propagation (BP) based on the OpenGM package [ABK12], shows that our approach is on par with these methods regarding the energy level E(x) of the resulting labeling x. (4) A final experiment based on the graphical model with a pronounced non-uniform (non-Potts) prior demonstrates that our approach is able to perform inference for any given graphical model.We conclude in Section 8 and relegated to an Appendix a series of proofs that otherwise would interrupt too much the line of reasoning.

PRELIMINARIES
We introduce basic notation in Section 2.1 and the common linear programming (LP) relaxation of the labeling problem in Section 2.2.In order to clearly distinguish between the LP relaxation and our geometric approach to the labeling problem based on [ ÅPSS17] (see Section 3.1), we keep the standard notation in the literature for the former approach and the notation from [ ÅPSS17] for the latter one.Remark 3.1 below identifies variables of both approaches that play a similar role.
2.1.Basic Notation.For an undirected graph G = (V, E), the adjacency relation i ∼ j means that vertices i and j are connected by an undirected edge ij ∈ E, where the latter denotes the unordered pair {i, j} = ij = ji.The neighbors of vertex i form the set of all vertices adjacent to i, and its cardinality d(i) = |N (i)| is the degree of i. G is turned into a directed graph by assigning an orientation to every edge ij, which then form ordered pairs denoted by (i, j) = ij = ji = (j, i).We only consider graphs without multiple edges between any pair of nodes i, j ∈ V.
We use the abbreviation [n] = {1, 2, . . ., n} for n ∈ N. R = R ∪ {+∞} denotes the extended real line.All vectors are regarded as column vectors, and x denotes transposition of a vector x.We ignore transposition however when vectors are explicitly specified by their components; e.g.we write x = (y, z) instead of the more cumbersome x = (y , z ) .We set 1 n = (1, 1, . . ., 1) ∈ N n and write 1 if n is clear from the context.x, y = i∈ [n] x i y i denotes the Euclidean inner product.Given a matrix we denote the row vectors by A i , i ∈ [m] and the column vectors by A j , j ∈ [n].The canonical matrix inner product is A, B = tr(A B), where tr denotes the trace of a matrix, i.e. tr(A B) The set of nonnegative vectors x ∈ R n is denoted by R n + and the set of strictly positive vectors by R n ++ .The probability simplex ∆ n = {p ∈ R n + : 1 n , p = 1} contains all discrete distributions on [n].A doubly stochastic matrix µ ij ∈ R n×n + , also called coupling measure in this paper in connection with discrete optimal transport, has the property: µ ij 1 n ∈ ∆ n and µ ij 1 n ∈ ∆ n .We denote these two marginal distributions of µ ij by µ i and µ j , respectively, and the linear mapping for extracting them by Its transpose is given by (2.3b) The kernel (nullspace) of a linear mapping A is denoted by N (A) and its range by R(A).
The functions exp, log apply componentwise to strictly positive vectors x ∈ R n ++ , e.g. e x = (e x 1 , . . ., e xn ), and similarly for strictly positive matrices.Likewise, if x, y ∈ R n ++ , then we simply write for the componentwise multiplication and subdivision.We define F 0 to be the class of proper, lower-semicontinuous and convex functions defined on R n .For any function f ∈ F 0 , ∂f (x) denotes its subdifferential at x, and the conjugate function f * ∈ F 0 of f is given by the Legendre-Fenchel transform (cf.[RW09, Section 11.A]) (2.5) For a given closed convex set C, its indicator function is denoted by (2.8b) We will use the following basic result from convex analysis (cf., e.g.[RW09, Ch. 11]), where ∂f (x) denotes the subdifferential of a function f ∈ F 0 at x.
Theorem 2.1 (inversion rule for subgradients).Let f ∈ F 0 .Then (2.9b) We will also apply the following classical theorem of Danskin and its extension by Rockafellar.
Theorem 2.2 ([Dan66, Roc91]).Let f (z) = max w∈W g(z, w), where W is compact and the function g is differentiable and ∇ z g(z, w) is continuously depending on (z, w).If in addition g(z, w) is convex in z, and if z is a point such that arg max w∈W g(z, w) = {w}, then f is differentiable at z with which merely encode the values of the discrete objective function (1.2): ).These local terms are commonly called unary and pairwise terms in the literature.Recall from the discussion of (1.2) that the unary terms represent the data and the pairwise terms specify a regularizer.All these local terms are indexed by the vertices i ∈ V and edges ij ∈ E of the underlying graph G = (V, E) and assembled into the vectors where we conveniently regard θ ij ∈ R n 2 either as local vector or as local matrix θ ij ∈ R n×n , depending on the context.Next we define local indicator vectors indexed in the same way as (2.11) and assembled into the vectors µ = (µ V , µ E ), µ V := (. . ., µ i , . . .), i ∈ V, µ E := (. . ., µ ij , . . .), ij ∈ E. (2.14) The combinatorial optimization problem (1.2) now reads min µ θ, µ .The corresponding linear programming relaxation consists in replacing the discrete feasible set of (2.13) by the convex polyhedral sets As a result, the linear programming relaxation (1.3) of (1.2) reads more explicitly where the so-called local polytope L G is the set of all vectors µ of the form (2.14) with components ranging over the sets specified by (2.15).The adjective "local" refers to the local marginalization constraints (2.15b).

IMAGE LABELING ON THE ASSIGNMENT MANIFOLD
This section sets the stage for our approach to solving approximately the labeling problem (1.2).We first introduce in Section 3.1 in terms of the assignment manifold the setting for the smooth approach to image labeling [ ÅPSS17], to be sketched in Section 3.2.Section 3.3 summarizes the general framework of [SH Å+ 17] for numerically integrating Riemannian gradient flows of functionals defined on the assignment manifold.
3.1.The Assignment Manifold.The relative interior of the probability simplex S := rint(∆ n ), given by Due to 1, v = 0 for all v ∈ T , we have the orthogonal decomposition R n = T ⊕ R1.The orthogonal projection onto T is given by where I denotes the (n × n) identity matrix.The manifold S becomes a Riemannian manifold by endowing it with the Fisher-Rao metric.At a point p ∈ S, this metric is given by where all operations are applied componentwise, i.e. u √ p = ( u 1 √ p 1 , . . ., un √ pn ).In this setting, there is an important map, called the lifting map (cf.[ ÅPSS17, Def.4]), defined as By restricting L onto the tangent space, we obtain a diffeomorphism This restricted lifting map L is also a local first order approximation to the exponential map of the Riemannian manifold S (cf [ ÅPSS17, Prop.3]), with the inverse mapping given by The assignment manifold is defined as the product manifold W := i∈[m] S and can be identified with the space W = {W ∈ R m×n ++ : W 1 = 1} of row-stochastic matrices with full support.With the Riemannian product metric, W also becomes a Riemannian manifold with constant tangent space The Fisher-Rao product metric reads The orthogonal decomposition of T induces the orthogonal decomposition together with the orthogonal projection where Π is the projection matrix from (3.2) above.Thus, the projection of a matrix X onto T m is just the projection (3.2) applied to every row of X.The lifting map, the restricted lifting map and its inverse are naturally extended to for every W ∈ W, by applying L : R n → S, L : T → S and L −1 : S → T from (3.4), (3.5), (3.6) to every row, LW (X 3.2.Image Labeling on W. In [ ÅPSS17] the following approach was proposed.Let G = (V, E) be a graph with vertex set V = [m].Suppose a function is given on this graph with values in some feature space F, Furthermore, let the set X = { 1 , . . ., n } from (1.1) denote a set of prototypes or labels (possibly X ⊂ F) and assume a distance function is specified, measuring how well a feature is represented by a certain prototype.We are interested in the assignment of the prototypes to the data in terms of an assignment matrix W ∈ W ⊂ R m×n .The elements of W can be interpreted as the posterior probability that j generated the observation f i .The assignment task of determining an optimal assignment W * can thus be interpreted as finding an 'explanation' of the data in terms of the prototypes X .
Remark 3.1 (W vs. µ).Each row vector W i , i ∈ [m] plays the role of a corresponding vector µ i of the basic LP relaxation as defined by (2.13), with relaxed domain due to (2.15).Unlike µ i , however, vectors W i ∈ R n ++ always have full support and live on the manifold S. The objective function for measuring the quality of an assignment involves three matrices defined next.First, all distance information between observed feature vectors and prototypes (labels) are gathered by the distance matrix ) and then lifted onto the assignment manifold at W ∈ W. By using (3.11) we obtain the likelihood matrix where each row i of L is given by is defined as a local geometric average of assignment vectors at neighboring nodes, i.e. the i-th row S i is defined to be the Riemannian mean (cf.[ ÅPSS17, Def.2]) The correlation between W and the local averages defining S(W ), as measured by the basic matrix inner product, is used as the objective function to be maximized.The optimization strategy is to follow the Riemannian gradient ascent flow on W (see Section (3.3) for the formal definition of the Riemannian gradient) The initialization W i (0) = 1 n 1 n with the barycenter of S constitutes an uninformative uniform assignment which is not biased towards any prototype.
To obtain an efficient numerical algorithm, the Riemannian mean is approximated using the geometric mean Based on the simplifying, plausible assumption that the mean only changes slowly and by using the explicit Euler-method directly on W with a certain adaptive step-size (cf.[ ÅPSS17, chap.3.3]), the following multiplicative update scheme is obtained (3.23) 3.3.Geometric Integration of Gradient Flows.In this section we collect the basic ingredients needed in the remainder of this paper, of a general framework due to [SH Å+ 17] for integrating a Riemannian gradient flow of an arbitrary function J : W → R defined on the assignment manifold.We first recall the definition of the Riemannian gradient.Let M be a Riemannian manifold with an inner product g M on the tangent space and f : M → R a smooth function.Using the identification where Df (x) : Suppose J : W → R is a general smooth objective function modeling an assignment problem and we are interested in minimizing J by following the Riemannian gradient descent flow with the barycenter C = 1 n 1 m 1 n .Instead of directly minimizing J on W, the basic idea of [SH Å+ 17] is to pull the optimization problem back onto the tangent space T m = T C W by setting using the diffeomorphism L C : T m → W given by (3.11).Furthermore, the pullback of the Fisher-Rao metric under L C is used to equip T m with a Riemannian metric and to turn L C into an isometry.In this setting, the Riemannian gradient of J : where ∇J denotes the standard Euclidean gradient of J : W → R. Based on this construction, solving the gradient flow (3.25) is equivalent to where Choosing the explicit Euler method for solving this gradient flow problem on the vector space T m , results in the numerical update scheme with step-size h ∈ R. Lifting this update scheme to the assignment manifold W yields a multiplicative update rule

ENERGY, GRADIENTS AND WASSERSTEIN MESSAGES
In this section we study the smooth objective function (1.4) restricted to the assignment manifold, in order to prepare the application of the approach of Section 3 to graphical models in Section 5.
After detailing the rationale behind (1.4) in Section 4.1, we compute the Euclidean gradient of the objective function in Section 4.2 on which the Riemannian gradient will be based.This gradient involves the gradients of local Wasserstein distances that are considered in Section 4.3.From the viewpoint of belief propagation, these gradients can be considered as 'Wasserstein messages', as discussed in Section 5. 4.1.Smooth Approximation of the LP Relaxation.The starting point (3.16) for applying the labeling approach of Section 3.2 to a given problem is a definition of suitable distances.Regarding problem (1.2) and the corresponding model parameter vector θ defined by (2.12), this is straightforward to do for the unary terms θ i that typically measure a local distance to observed data.But this is less obvious for the pairwise terms θ ij that do not have a direct counterpart in the geometric labeling approach.
The following Lemma explains why the local Wasserstein distances defined for every edge ij ∈ E with Π(µ i , µ j ) due to (2.15b), are natural candidates for taking into account pairwise model parameters θ ij .
Lemma 4.1.The local polytope relaxation (2.16) is equivalent to the problem involving the local Wasserstein distances (4.1).
Proof.The claim follows from reformulating the LP-relaxation based on the local polytope constraints (2.15) as follows.
In order to conform to our smooth geometric setting, we regularize the convex but non-smooth (piecewiselinear) local Wasserstein distances (4.1) with a general convex smoothing function F τ , with smoothing parameter τ .
Remark 4.1 (role of the smoothing).The influence of the smoothing parameter τ will be examined in detail in the remainder of this paper.We wish to point out from the beginning, however, that the ability of our smooth geometric approach to compute integral labeling assignments does not necessarily imply values of τ ≈ 0 close to zero, because the rounding mechanism to integral assignments is a different one, as will be shown in Section 5.As a consequence, larger feasible values of τ weaken the nonlinear relation (4.3) and considerably speed up the convergence of numerical algorithm for iterative label assignment.
Remark 4.2 (local polytope constraints).Using the regularized local Wasserstein distances (4.3) implies by their definition that the local marginalization constraints (2.15) are always satisfied.This is in sharp contrast to alternative labeling schemes, like loopy belief propagation, were these constraints are gradually enforced during the iteration and are guaranteed to hold only after convergence of the entire iteration process.This elucidates two key properties that distinguish the manifold setting of our labeling approach from established work: (i) inherent smoothness and (ii) anytime validity of the local polytope constraints.
Based on Lemma 4.1 and the regularized local Wasserstein distances (4.3), we study in this paper the objective function (1.4), which is a smooth approximation of the local polytope relaxation (2.16) of the original labeling problem (1.2), with the local polytope constraints (2.15) built in.
In order to get an intuition about suitable smoothing functions F τ , we inspect the smoothed local Wasserstein distance (4.1) in more detail.To this end, it will be convenient to simplify temporarily our notation in the remainder of this section by dropping indices as follows.
notation for any edge ij : with the marginal vector µ playing the role of µ i µ j in (2.15).The local (non-smooth) Wasserstein distance (4.1) then reads, for any edge ij ∈ E, Using the linear map A defined by (2.3a), we rewrite expression (4.5) as The corresponding dual LP of (4.6) is given by The smoothed local Wasserstein distance (4.3) is given by for F τ ∈ F 0 and τ > 0, and the dual problem to (4.8) reads Suitable candidates of functions G τ for smoothing d Θ suggest themselves by comparing the dual LP (4.7) with the dual problem (4.9) of the smoothed LP.Rewriting the constraints of (4.7) in the form and comparing with (4.9) shows that G * τ should be a smooth approximation of the indicator function δ R n×n − .We get back to this point in Section 6.2.4.2.Energy Gradient ∇E τ .The pairwise model parameters θ E may not be symmetric, θ ij = θ ij , ij ∈ E, in general, which implies that the smoothed local Wasserstein distances are not symmetric either: In order to compute the Euclidean gradient ∇E τ of the objective function (1.4), we therefore introduce an arbitrary fixed orientation (i, j) (ordered pair) of all edges ij ∈ E, which means ij ∈ E =⇒ ji ∈ E. As a consequence, (1.4) reads (4.12) The following proposition specifies the gradient ∇E τ in terms of an expression that involves local gradients of the smoothed Wasserstein distances d θ ij ,τ .These latter gradients are studied in Section 4.3 (Theorem 4.5).
Proposition 4.2 (objective function gradient).Suppose the edges E have an arbitrary fixed orientation.
Then the Euclidean gradient of the objective function E τ : W → R due to (1.4), at W ∈ W, is the matrix ∇E τ (W ) ∈ T m whose i-th row is given by where ∇ 1 d θ ij ,τ (W i , W j ) ∈ T and ∇ 2 d θ ji ,τ (W j , W i ) ∈ T are the Euclidean gradients of Proof.Appendix A.1.
We now consider after a preparatory Lemma the specific case that all pairwise model parameters θ ij = θ ij are symmetric (Corollary 4.4).Recall definition (2.15b) of the set Π(•, •) of coupling measures having its arguments as marginals and Remark 3.1 regarding notation.
Lemma 4.3.Suppose the convex smoothing function F τ defining the regularized local Wasserstein distances As a consequence of Lemma 4.3, if all pairwise model parameters θ ij are symmetric, in addition to F τ (M ) = F τ (M ) for all M ∈ [0, 1] n×n , then there is no need to choose an edge orientation as was done in connection with (4.12).Rather, using (2.1), we may rewrite (4.12) as and reformulate Proposition 4.2 accordingly.
Corollary 4.4 (objective function gradient: symmetric case).Suppose F τ (T ) = F τ (T ) for all T ∈ [0, 1] n×n and θ ij is symmetric for all ij ∈ E. Then the i-th row of the Euclidean gradient ∇E τ is given by (4.17) Proof.Appendix A.3.

4.3.
Local Wasserstein Distance Gradient.In this section, we check differentiability of the distance functions d θ ij ,τ (µ i , µ j ), ij ∈ E, given by (4.3), and specify an expression for the corresponding gradient.To formulate the main result of this section, we again use the simplified notation (4.4).
Theorem 4.5 (Wasserstein distance gradient).Consider S ⊂ R n as an Euclidean submanifold with tangent space T defined by (3.1), and let denote the dual objective function (4.23).Then the smoothed Wasserstein distance d Θ,τ : S × S → R is differentiable, and the Euclidean gradient of d Θ,τ at p = (p 1 , p 2 ) ∈ S × S is given by where The proof follows below after some preparatory Lemmas.
with the convex smoothing function F τ of eq.(4.3), and assume the conjugate function G * τ is continuously differentiable.Then the dual problem of is given by max Furthermore, assuming that strong duality holds, the conditions for optimal primal M and dual ν = (ν 1 , ν 2 ) solutions are together with the affine constraint The smoothness assumption with respect to G * τ enables to compute conveniently the gradient of the smoothed Wasserstein distance d Θ,τ .It corresponds to a convexity assumption on G τ .These aspects are further discussed in Section 6.2 as well.
Remark 4.4 (strong duality).The condition of strong duality made by Lemma 4.6 is crucial for what follows.This condition will be satisfied later on when working in a geometric setting with local measures M, µ 1 , µ 2 with full support, as introduced in Section 3.1.Lemma 4.7.Let the linear mapping A be defined by (2.3b).Then Proof.Appendix A.5.
The following Lemma characterizes the set of optimal dual solutions to problem (4.23).
Lemma 4.8.Let the function G * τ of the dual objective function (4.23) resp.(4.18) be continuously differentiable and strictly convex, and let p ∈ R 2n ++ .Then the set of optimal dual solutions has the form (4.26) Proof.Appendix A.6.
We next clarify the attainment of optimal dual solutions due to Lemma 4.8.
Lemma 4.9.Consider the orthogonal decomposition R 2n = N (A ) ⊕ R(A) into linear subspaces and denote the corresponding components of a vector ν ∈ R 2n by ν = ν N + ν R .Then, for p ∈ R 2n ++ satisfying p, 1n −1n = 0, we have that is a unique dual maximizer exists in the subspace R(A).
We are now in a position to prove Theorem 4.5.
Proof of Theorem 4.5.In the following, we proceed in three steps: First, we relate the orthogonal decomposition R 2n = N (A ) ⊕ R(A) to the tangent space T p (S × S) = T × T ⊂ R 2n for any p = (p 1 , p 2 ) ∈ S × S.
Second, the existence of a global isometric chart for the manifold S × S is shown in order to represent the smoothed Wasserstein distance d Θ,τ and the dual objective function g(µ, ν) in a convenient way.Third, we apply Theorem 2.2.
(1) Consider the unique decomposition At first, we show T ×T ⊆ R(A).For this, take an arbitrary v = ( v 1 v 2 ) ∈ T ×T .Due to the definition of T , we have 1 n , v 1 = 1 n , v 2 = 0 and thus v, 1n −1n = 0, which according to Lemma 4.7 means v ∈ N (A ) ⊥ = R(A).As a consequence of T × T ⊆ R(A) we have P T ×T (ν N ) = 0 and therefore Statement (4.28) follows from is orthonormal, the orthogonal projection reads P T ×T = BB .(4.32) (3) Using φ given by (4.31), we obtain the coordinate representations of the smoothed Wasserstein distance d Θ,τ and the dual objective function g(p, ν).Since we assume strong duality, that is equality of the optimal values of (4.22) and (4.23), we have d Θ,τ (p) = max ν∈R 2n g(p, ν).Setting x p = φ −1 (p), this equation translates in view of Lemma 4.9 to with unique maximizer ν R = P R(A) (ν).Let B δ ⊂ R(A) be a compact neighborhood of ν R .Then (4.34) remains valid after restricting R(A) to B δ .Because g given by (4.18) is linear in the first argument and the mapping φ is affine, the function g is convex in the first argument and differentiable, hence satisfies the assumptions of Theorem 2.2.In order to compute the gradient ∇ x g(x, ν R ), it suffices to consider the first term φ(x), ν R of g, which only depends on x.Using (4.31), we have Thus, ∇ x g(x, ν R ) = B ν R which continuously depends on ν R .As a consequence, we may apply Theorem 2.2 and obtain due to (2.10) Using the differential Dφ(x) = B, we finally get which proves (4.19).

APPLICATION TO GRAPHICAL MODELS
This section explains how the labeling approach on the assignment manifold of Section 3 can be applied to a graphical model, using the global and local gradients derived in Section 4. The graphical model is given in terms of an energy function E(x) of the form (1.2).The basic idea, worked out in Section 5.1, for determining a labeling x with low energy E(x) is to combine minimization of the convex relaxation (1.3) and non-convex rounding to an integral solution in a single smooth process.This idea is realized by restricting the smooth approximation (1.4) of the objective function to the assignment manifold due to Section 3.1, and by combining numerical integration of the corresponding Riemannian gradient flow due to Section 3.3 with the assignment mechanism suggested by [ ÅPSS17] due to Section 3.2.
Section 5.2 complements our preliminary observations stated as Remarks 4.1 and 4.2, in order to highlight the essential properties of this process as a novel way of 'belief propagation' using dually computing gradients of local Wasserstein distances, that we call Wasserstein messages.

Smooth Intergration of Minimizing and
Rounding on the Assignment Manifold.We recall how regularization is performed by the assignment approach of [ ÅPSS17]: distance vectors (3.16) representing the data term of classical variational approaches are lifted to the assignment manifold by (3.17) and geometrically averaged over spatial neighborhoods -eq.(3.19) resp.(3.22).
Given a graphical model in terms of an energy function (1.2), regularization is already defined by the pairwise model parameters E ij ( i , j ) resp.θ ij ( i , j ), so that evaluating the gradient of the regularized objective function (1.4) implies averaging over spatial neighborhoods, as eq.(4.13) clearly displays.Taking additionally into account the simplest (explicit Euler) update rule (3.31) for geometric integration of Riemannian gradient flows on the assignment manifold, a natural definition of the similarity matrix that consistently incorporates the graphical model into the geometric approach of [ ÅPSS17], is where h is a stepsize parameter and the partial gradients ∇ i E τ (W (k) ) are given by (4.13).The sequence (W (k) ) is initialized in an unbiased way at the barycenter W (0) ∈ W. Adopting the fixed point iteration proposed by [ ÅPSS17] leads to the update of the assignment matrix (5.2) These two interleaved update steps represent two objectives: (i) minimize the function E τ on the assignment manifold W (Section 3.3) and (ii) converge to an integral solution, i.e. a valid labeling.Plugging (5.1) into (5.2) gives

,
(5.3) which suggests to control more flexibly the latter rounding mechanism by a rounding parameter α and the update rule , α ≥ 0. (5.4) The following proposition reveals the continuous gradient flow that is approximated by the sequence (5.4).
Proposition 5.1.Let E τ be given by (1.4) and denote the entropy of the assignment matrix W by (5.5) Then the sequence of updates (5.4) are geometric Euler-steps for numerically integrating the Riemannian gradient flow of the extended objective function (5.6) Proof.An Euler-step for minimizing f τ,α on the tangent space reads (with where the i-th row of W (k) is given by W In order to compute the gradient of the entropy, consider a smooth curve γ : (−ε, ε) → W with γ(0) = W and γ(0) = X.Then (5.8) Since log(W ), X = P T m log(W ), X and 11 , X = 1, X1 = 1, 0 = 0, we have (5.9) Thus, using P T (log(W i )) = L −1 c (W i ) from (3.6), we obtain (5.10) Substitution into (5.7)gives and in turn the update (5.12a) (5.12c) which is (5.4).
Remark 5.1 (continuous DC programming).Proposition 5.1 and (5.6) admit to interpret the update rule (5.4) as a continuous difference of convex (DC) programming strategy.Unlike the established DC approach [PDHA97, PDHA98], however, which takes large steps by solving to optimality a sequence of convex programs in connection with updating an affine upper bound of the concave part of the objective function, our update rule (5.4) differs in two essential ways: geometric optimization by numerically integrating the Riemannian gradient flow tightly interleaves with rounding to an integral solution.
5.2.Wasserstein Messages.We get back to the informal discussion of belief propagation in Section 1.2 in order to highlight properties of our approach (1.4) from this viewpoint.We first sketch belief propagation and the origin of corresponding messages, and refer to [YFW05,WJ08] for background and more details.
Starting point is the primal linear program (LP) (1.3) written in the form where the constraints represent the feasible set L G which is explicitly given by the local marginalization constraints (2.15).The corresponding dual LP reads with dual (multiplier) variables corresponding to the affine primal constraints.In order to obtain a condition that relates optimal vectors µ and ν without subdifferentials that are caused by the non-smoothness of these LPs, one considers the smoothed primal convex problem with smoothing parameter ε > 0, degree d(i) of vertex i, and with the local entropy functions (5.17) Setting temporarily ε = 1 and evaluating the optimality condition ∇ µ L(u, v) = 0 based on the corresponding Lagrangian L(u, v) = θ, µ − H(µ) + ν, Aν − b , (5.18) yields the relations connecting µ and ν, k∈N (j)\{i} e ν jk (x j ) , (5.19b) x i , x j ∈ X , ij ∈ E, where the terms e ν i , e ν i +ν j normalize the expressions on the right-hand side whereas the so-called messages e ν ij (x i ) enforce the local marginalization constraints µ ij ∈ Π(µ i , µ j ).Invoking these latter constraints enables to eliminate the left-hand side of (5.19) to obtain after some algebra the fixed point equations solely in terms of the dual variables, commonly called sum-product algorithm or loopy belief propagation by message passing.Repeating this derivation, after weighting the entropy function H(µ) of (5.18) by ε as in (5.16), and taking the limit lim ε 0 , yields relation (5.20) with the sum replaced by the max operation, as a consequence of taking the log of both sides and relation (2.8).This fixed point iteration is called max-product algorithm in the literature.
In order to highlight subsequently major differences to our approach, we make the following key observations: (1) Local non-convexity.The negative −H(µ) of the so-called Bethe entropy function H(µ) is nonconvex in general for graphs G with cycles [WJ08, Section 4.1], due to the negative sign of the second sum of (5.16).(2) Local rounding at each step.The max-product algorithm performs local rounding at every step of the iteration so as to obtain integral solutions, i.e. a labeling after convergence.This operation results as limit of a non-convex function, due to (1).(3) Either nonsmoothness or strong nonlinearity.The latter max-operation is inherently nonsmooth.
Preferring instead a smooth approximation with 0 < ε 1 necessitates to choose ε very small so as to ensure rounding.This, however, leads to strongly nonlinear functions of the form (2.8) that are difficult to handle numerically.(4) Invalid constraints.Local marginalization constraints are only satisfied after convergence of the iteration.Intuitively it is plausible that, by only gradually enforcing constraints in this way, the iterative process becomes more susceptible to getting stuck in unfavourable stationary points, due to the non-convexity according to (1).Our geometric approach removes each of these issues.Message passing with respect to vertex i ∈ V is defined by evaluating the local Wasserstein gradients of (4.13) for all edges incident to i.We therefore call Wasserstein messages these local gradients that are 'passed along edges'.Similarly to (5.20), each such message is given by dual variables through (4.19), that solve the regularized local dual LPs (4.18).As a consequence, local marginalization constraints are always satisfied, throughout the iterative process.
In addition, we make the following observations in correspondence to the points (1)-(4) above: (1) Local convexity.Wasserstein messages of (4.13) are defined by local convex programs (4.18).This contrasts with loopy belief propagation and holds true for any pairwise model parameters θ ij of the prior of the graphical model and the corresponding coupling of µ i and µ j .This removes spurious minima introduced through non-convex entropy approximations.(2) Smooth global rounding after convergence.Rounding to integral solutions is gradually enforced through the Riemannian flow induced by the extended objective function (5.6).In particular, repeated 'aggressive' local max operations of the max-product algorithm are replaced by a smooth flow.(3) Smoothness and weak nonlinearity.The role of the smoothing parameter τ of (1.4) differs from the role of the smoothing parameter ε of (5.16).While the latter has to be chosen quite close to 0 so as to achieve rounding at all, τ merely mollifies the dual local problems (4.18) and hence should be chosen small, but may be considerably larger than ε.In particular, this does not impair rounding due to (2), which happens due to the global flow which is smoothly driven by the Wasserstein messages.This decoupling of smoothing and rounding enables to numerically compute labelings more efficiently.The results reported in Section 7 demonstrate this fact.(4) Valid constraints.By construction, computation of the Wasserstein messages enforces all local marginalization constraints throughout the iteration.This is in sharp contrast to belief propagation where this generally holds after convergence only.Intuitively, it is plausible that our more tightly constrained iterative process is less susceptible to getting stuck in poor local minima.The results reported in Section 7.2 provide evidence of this conjecture.

IMPLEMENTATION
In this section we discuss several aspects of the implementation of our approach.The numerical update scheme used in our implementation is given by (5.4), where α ≥ 0 is the rounding parameter, h > 0 the step-size and τ the smoothing parameter for the local Wasserstein distances.Section 6.1 details a strategy for maintaining in a numerically stable way strict positivity of all variables defined on the assignment manifold.Numerical aspects of computing local Wasserstein gradients are discussed in Section 6.2, and the natural role of the entropy function is highlighted for assuming the role of the smoothing function F τ in eq.(4.3).Our criterion for convergence and terminating the iterative process (6.1) of label assignment is specified in Section 6.3.6.1.Assignment Normalization.The rounding mechanism addressed by Prop.5.1 and Remark 5.1 will be effective if α h in (5.6) is chosen large enough so as to compensate the influence of the function F τ that regularizes the local Wasserstein distances (4.3).
In this case, each vector W i approaches some vertex e i of the simplex and thus some entries of W i converge to zero.However, due to our optimization scheme every vector W i evolves on the interior of the simplex S, that is all entries of W i have to be positive all the time -see also Remark 4.4.Since there is a difference between mathematical and numerical positivity, we avoid numerical problems by adopting the normalization strategy of [ ÅPSS17].After each iteration, we check all W i and whenever an entry drops below ε = 10 −10 , we rectify W i by Thus, the constant ε plays the role of 0 in our implementation.Our numerical experiments showed that this operation removed any numerical issues without affecting convergence.
6.2.Computing Wasserstein Gradients.A core subroutine of our approach concerns the computation of the local Wasserstein gradients as part of the overall gradient (4.13).We argue in this section why the negative entropy function that we use in our implementation for smoothing the local Wasserstein distances, plays a distinguished role.To this end, we adopt again in this section the notation (4.4).Using this notation the smooth entropy regularized Wasserstein distance (4.3) reads with the entropy function As shown in Section 4.3 and according to Theorem 4.5, the gradients of (6.3) are the maximizer of the corresponding dual problem.Using the notation (4.4), the dual problem of (6.3) reads In particular, in view of the general form (4.9) of this dual problem, the indicator function (4.11) is smoothly approximated by the function τ exp( 1 τ x). Figure 6.1 compares this approximation with the classical logarithmic barrier − log(−x) function for approximating the indicator function δ R − of the nonpositive orthant.Log-barrier penalty functions are the method of choice for interior point methods [NN87, Ter96], which strictly rule out violations of the constraints.While this is essential for many applications where constraints represent physical properties that cannot be violated, it is not essential in the present case for calculating the Wasserstein messages.Moreover, the bias towards interior points by log-barrier functions, as Figure 6.1 clearly shows, is detrimental in the present context and favours the formulation (6.5).We now make explicit how the local Wasserstein gradients (4.19) are computed based on the formulation (6.3) and examine numerical aspects depending on the smoothing parameter τ .It is well known that doubly stochastic matrices as solutions of convex programs like (6.3) can be computed by iterative matrix scaling [Sin64,Sch90], [Bru06,ch. 9].This has been made popular in the field of machine learning by [Cut13b].
The optimality condition (4.24) takes the form and rearranging makes explicit the connection to matrix scaling: where Diag(•) denotes the diagonal matrix with the argument vector as entries.For given marginals µ = (µ 1 , µ 2 ) due to (6.3) and with the shorthand K = exp − 1 τ Θ , the optimal dual variables ν = (ν 1 , ν 2 ) can be determined by the Sinkhorn's iterative algorithm [Sin64], up to a common multiplicative constant.Specifically, we have Lemma 6.1 ([Cut13b, Lemma 2]).For τ > 0, the solution M of (6.3) is unique and has the form M = diag(v 1 )Kdiag(v 2 ), where the two vectors v 1 , v 2 ∈ R n are uniquely defined up to a multiplicative factor.

Accordingly, by setting
the corresponding fixed point iterations read which are iterated until the change between consecutive iterates is small enough.Denoting the iterates after convergence by v 1 , v 2 , resubstitution into (6.8)determines the optimal dual variables Due to Theorem 4.5, the local Wasserstein gradients then finally are given by where the projection P T due to (3.2) removes the common multiplicative constant resulting from Sinkhorn's algorithm.
While the linear convergence rate of Sinkhorn's algorithm is known theoretically [Kni08], the numbers of iterations required in practice significantly depends on the smoothing parameter τ .In addition, for smaller values of τ , an entry of the matrix K = exp − 1 τ Θ might be too small to be represented on a computer, due to machine precision.As a consequence, the matrix K might have entries which are numerically treated as zeros and Sinkhorn's algorithm does not necessarily converge to the true optimal solution.
Fortunately, our approach does not encounter such problems because merely a sufficiently accurate approximation of the gradient of the Wasserstein distance is required, rather than an approximation of the Wasserstein distance itself, to obtain valid descent directions for the energy function to be minimized.Figures 6.2 and 6.3 demonstrate that this indeed holds for relatively large values of τ , e.g.τ ∈ { 1 5 , 1 10 , 1 15 }, no matter if the number of labels is n = 10 or n = 1000.n 1 to the vertex e 1 on the simplex ∆ n .The cost matrix Θ is given by the Potts regularizer.In all three plots the parameter τ has been chosen as τ = 1 5 (cyan), τ = 1 10 (green), τ = 1 20 (blue), τ = 1 50 (red) and τ = 1 100 (black).Even though the values of the approximation of the distance itself differ considerably, the slope of the distance that really matters, is approximated already for larger values of τ quite well and uniformly for small up to large numbers n of labels.6.3.Termination Criterion.In all experiments, the normalized averaged entropy W ik log W ik (6.12) was used as a termination criterion, i.e. if the value drops below a certain threshold the algorithm is terminated.Due to this normalization, the value does not depend on the number of labels and thus the threshold is comparable across different models with a varying number of pixels and labels.For example, a threshold of 10 −4 means in practice that, up to a small fraction of nodes i ∈ V, all rows W i of the assignment matrix W are very close to unit vectors and thus indicate an almost unique assignment of the prototypes or labels to the observed data.

EXPERIMENTS
We demonstrate in this section main properties of our approach.The dependency of label assignment on the smoothing parameter τ and the rounding parameter α is illustrated in Section 7.1.We comprehensively explored the space of binary graphical models defined on the minimal cyclic graph, the complete graph with three vertices K 3 , whose LP-relaxation is known to have a substantial part of nonbinary vertices.The results reported in Section 7.2 exhibit a relationship between α and τ so that in fact a single effective parameter only controls the trade off between accuracy of optimization and the computational costs.A competitive evaluation of our approach in Section 7.3 together with two established and widely applied approaches, sequential tree-reweighted message passing (TRWS) [Kol06] and loopy belief propagation reveals similar 10 and 1 20 , the gradient of the Wasserstein distance is sufficiently accurate approximated so as to obtain valid descent directions for distance minimization.performance of our approach.Finally, Section 7.4 demonstrates for a graphical model with pronounced nonuniform pairwise model parameters (non-Potts prior) that our geometric approach accurately takes them into account.
All experiments have been selected to illustrate properties of our approach, rather than to demonstrate and work out a particular application which will be the subject of follow-up work.7.1.Parameter Influence.We assessed the parameter influence of our geometric approach by applying it to a labeling problem.The task was to label a noisy RGB-image f : V → [0, 1] 3 , depicted in Fig. 7.2, on the grid graph G = (V, E) with minimal neighborhood size |N (i)| = 3 × 3, i ∈ V. Prototypical colors P = {l 1 , . . ., l 8 } ⊂ [0, 1] 3 (Fig. 7.2) were used as labels.The unary (or data term) was defined using the • 1 distance and a scaling factor ρ > 0 by 2) The feature scaling factor was set to ρ = 0.3, the step-size h = 0.1 was used for numerically integrating the Riemannian descent flow, and the threshold for the normalized average entropy termination criterion (6.12) was set to 10 −4 .Figure 7.1 displays the empirical convergence rate depending on the rounding parameter α, for a fixed value of the smoothing parameter τ = 0.1 that ensures a sufficiently accurate approximation of the Wasserstein distance gradients and hence of the Riemannian descent flow.Less agressive rounding in terms of smaller values of α leads to a more accurate numerical integration of the flow using a larger number of iterations, and thus to higher quality label assignments with a lower energy of the objective function.This latter aspect is demonstrated quantitatively in Section 7.2.Overall, the total number of iterations is significantly smaller than the number of iterations which first-order methods of convex programming require for solving the LP relaxation [LS11].For too small values of the rounding parameter α, the algorithm does naturally not converge to an integral solution.Fig. 7.2 shows the influence of the rounding strength α and the smoothing parameter τ for the Wasserstein distance.All images marked with an ' * ' in the lower right corner do not show an integral solution, which means that the normalized average entropy (6.12) of the assignment vectors W i did not drop below the threshold during the iteration and thus, even though the assignments show a clear tendency, they stayed far from integral solutions.As just explained for Fig. 7.1, this is not a deficiency of our approach but must happen if either no rounding is performed (α = 0) or if the influence of rounding is too small compared to the smoothing of the Wasserstein distance (e.g.α = 0.1 and τ = 0.5).Increasing the strength of rounding (larger α) leads to a faster decrease in entropy (cf.Fig. 7.1 for the case of τ = 0.1) and therefore to an earlier convergence of the process to a specific labeling.Thus, a more aggressive rounding scheme yields a less regularized result due to the rapid decision for a labeling at an early stage of the algorithm.On the other hand, choosing the smoothing parameter τ too large lead to poor approximations of the Wasserstein distance gradients and consequently to erroneous non-regularized labelings, as displayed in the left-most column of Fig. 7.1.Once τ is small enough, in our experiments: τ < 0.1, the Wasserstein distance gradients are properly approximated, and the label assignment is regularized as expected and can be conveniently controlled by α.In particular, this upper bound on τ is sufficiently large to ensure very rapid convergence of the fixed point iteration for computing the Wasserstein distance gradients.
7.2.Exploring all Cyclic Graphical Models on K 3 .In this section, we report an exhaustive exploration of all possible binary models, X = {0, 1}, on the minimal cyclic graph K 3 (Fig. 7.3, left panel).Due to the single cycle, models exist where the LP relaxation (1.3) returns a non-binary solution (red part of the right panel of Fig. 7.3).As a consequence, evaluating such models with our geometric approach for minimizing (1.4) enables to check two properties: (i) Whenever solving the LP relaxation (1.3) by convex programming returns the global binary minimum of (1.2) as solution, we assess if our geometric approach based on the smooth approximation (1.4) returns this solution as well.(ii) Whenever the LP relaxation has a non-binary vector as global solution, which therefore is not optimal for the labeling problem (1.2), we assess the rounding property of our approach by comparing the result with the correct binary labeling globally minimizing (1.2).The graph K 3 enables to specify the so-called marginal polytope P K 3 , whose vertices (extreme points) are the feasible binary combinatorial solutions that correspond to valid labelings (cf.Section 1.1), and to examine the difference to the local polytope L K 3 whose representation only involves a subset of the constraints corresponding to P K 3 .We refer to [Pad89] for background and details.
The constraints are more conveniently stated using the so-called minimal representation of binary graphical models [WJ08, Sect.3.2], that involves the variables1 and encodes the local vectors (2.15) by (7.4) FIGURE 7.4.EVALUATION OF THE MINIMAL GRAPHICAL MODEL K 3 : For every pair of parameters (τ, α), we evaluated 10 5 models, which were generated as explained in the text.
In all experiments, we terminate the algorithm if the average entropy dropped below 10 −3 or if the maximum number of 600 iterations was reached.In addition we chose a constant step-size h = 0.5.LEFT: The plot shows the percentage of experiments where the energy returned by our algorithm had a relative error smaller then 1% compared to the minimal energy of the globally optimal integral labeling.RIGHT: This plot shows the corresponding average number of iterations.The black color indicates that the maximum number of 600 iterations was reached, because too strong smoothing of the Wasserstein distance (large τ ) overcompensates the effect of rounding (small α), so that the convergence criterion (6.12) which measures the distance to integral solutions, cannot be satisfied.In the remaining large parameter regime, the choice of α enables to control the trade-off between high-quality (low-energy) solutions and computational costs.
Thus, it suffices to use a single variable µ i for every node i ∈ V instead of two variables µ i (0), µ i (1), and also a single variable µ ij for every edge ij ∈ E instead of four variables µ ij (0, 0), µ ij (0, 1), µ ij (1, 0), µ ij (1, 1).The local polytope constraints (2.15) then take the form The marginal polytope constraints additionally involve the so-called triangle inequalities [DL97] i∈V , which results in a larger influence of the pairwise terms and hence make inference more difficult.Suppose, for example, that the diagonal terms of FIGURE 7.5.These plots show the same results as Fig. 7.4 together with additional data boxes and information for three different configurations of parameter values.Comparing the success rate (left panel) and the number of iterations until convergence (right panel) clearly demonstrate the trade-off between the accuracy of optimization and the convergence rate, depending on the rounding variable α and the smoothing parameter τ .Overall, the number of iterations is significantly smaller than for first-order methods of convex programming for solving the LP relaxation, that additionally require rounding as a post-processing step to obtain an integral solution.θ ij are large, which favours the assignment of different labels to the nodes 1, 2, 3 ∈ V. Then assigning say the labels 0 and 1 to the vertices 1 and 2, respectively, will inherently lead to a large energy contribution due to the assignment to node 3, no matter if this third label is 0 or 1, because it must agree with the assignment either to node 1 or to2.
Every binary vertex listed by Fig. 7.3, right panel, is the global optimum of both the linear relaxation (1.3) and the original objective function (1.2) in approximately ≈ 11.94% of the 10 5 scenarios, whereas every non-binary vertex is optimal in approximately ≈ 1.12%.Fig. 7.4 presents the results of the experiments for the minimal cyclic graphical model K 3 .To get an intuition of our rounding parameter α and the smoothing parameter τ , we evaluated all 10 5 models for each pair of (τ, α), where τ ∈ { 1 2 , 1 2.5 , . . ., 1 6.5 , 1 7 } and α ∈ {0.1, 0.11, . . ., 0.99, 1}.Fig. 7.5 presents exactly the same results as Fig. 7.4, except that we include data boxes for three different configurations of parameter.E.g., for α = 0.22 and τ = 1 5 our algorithm found in 97.35% of the experiments an energy with relative error smaller then 1% compared to the optimal energy.At the same time, the algorithm required 45 iterations on average to converge.As shown by the following parameter, α = 0.58 and τ = 0.15, a larger value of α means that the rounding mechanism in each iteration step (5.4) is increased.The average number of iterations is reduced to 9, but the accuracy is also dropped to 88.6%.
Overall, these experiments clearly demonstrate • the ability to control the trade-off between high-quality (low energy) labelings and computational costs in terms of α, for all values of τ below a reasonably large upper bound, • a small or very small number of iterations required to converge, depending on the choice of α.

Original data
Noisy data FIGURE 7.6.Noisy image labeling problem: a binary ground truth image (left) to be recovered from noisy input data (right).
For this comparison, we evaluate the performance of the methods for a noisy binary labeling scenario depicted by Fig. 7.6.Let f : V → [0, 1] denote the noisy image data given on the grid graph G = (V, E) with a 4-neighborhood and X = {0, 1} be the prototypes.Then the used data term and Potts prior are given by for i ∈ V and θ ij = 0 1 1 0 for ij ∈ E .(7.8) In the experiments, the threshold for the normalized average entropy termination criterion (6.12) was 10 −4 .Figure 7.7 shows the visual reconstruction as well as the corresponding energy values and percentage of correct labels for all three methods.Our method has the same accuracy and returns the same optimal energy level as TRWS and Loopy-BP.
To investigate once more the influence of the rounding mechanism, we repeated the same experiment as explained above, but for different values of the rounding parameter α ∈ {0.1, 1, 2, 5}.As shown by Fig. 7.8, the results confirm the finding of the experiments of the preceding section: A more aggressive rounding scheme (α large) leads to faster convergence but yields less regularized results with higher energy values.Assume additional information about a labeling problem is available.For example, let the RGB-color dark blue in the image represent the direction "top", light blue "bottom", yellow "right", orange "left" and cyan "center" (Fig. 7.9 left).Suppose it is known beforehand that "top" and "bottom" as well as "left" and "right" cannot be adjacent to each other but are separated by another label corresponding to the center.This prior knowledge about the labeling problem was taken into account by specifying non-uniform pairwise model parameters accordingly, as specified below.
l 1 l 2 l 3 l 4 l 5 original image noisy image labels FIGURE 7.9.Original image (left), encoding the image directions "top", "bottom", "center", "left" and "right" by the RGB-color labels l 1 , l 2 , l 3 , l 4 and l 5 (right).The noisy test image (middle) was created by randomly selecting 40% of the original image pixels and then uniformly sampling a label at each chosen positions.The label constraints suggested by the labeling order leads to a non-Potts prior (7.11).
To demonstrate that these model parameters influence label assignments accordingly, we compared the evaluation of this model against one with a uniform Potts prior (7.12) For our experiments, we used the scaling factor ρ = 15 for the unaries, step-size h = 0.1, rounding parameter α = 0.01, smoothing parameter τ = 0.01 and 10 −4 as threshold for the normalized average entropy termination criterion (6.12).
Potts non-Potts Acc : 87.12% Acc : 99.34%The results depicted in Fig. 7.10, clearly show the influence of the non-Potts prior (labeling accuracy 99.34%) whereas using the Potts prior lowers accuracy to only 87.12%.This is due to the fact that the color labels 4 and 5 as well as 1 and 2 have a relatively small • 1 distance and are therefore not easy to distinguish using a Potts model.The additional prior information about valid label configurations encoded by (7.11) suffices to overcome this difficulty and to separate the regions correctly.

FIGURE 6
FIGURE 6.1.Approximations of the indicator function δ R − of the nonpositive orthant.The log-barrier function (black curves) strictly rule out violations of the constraints but induce a bias towards interior points.Our formulation (blue curves) is less biased and reasonable approximates the δ-function (red curve) depending on the smoothing parameter τ .

FIGURE 6
FIGURE 6.2.The plots show the entropy-regularized Wasserstein distance d Θ,τ (c, γ(t)) for varying parameter τ and increasing number of labels n.Here, γ(t) = t(e 1 − c) + c ∈ ∆ n with t ∈ [0, 1] is the line segment from the barycenter c = 1n 1 to the vertex e 1 on the simplex ∆ n .The cost matrix Θ is given by the Potts regularizer.In all three plots the parameter τ has been chosen as τ = 1 5 (cyan), τ = 1 10 (green), τ = 1 20 (blue), τ = 1 50 (red) and τ = 1 100 (black).Even though the values of the approximation of the distance itself differ considerably, the slope of the distance that really matters, is approximated already for larger values of τ quite well and uniformly for small up to large numbers n of labels.

FIGURE 6
FIGURE 6.3.The plot shows the entropy-regularized Wasserstein distance with the Potts regularizer from the barycenter to every point on ∆ 3 for different values of τ .Used values are (a) τ = 1 5 , (b) τ = 1 10 , (c) τ = 1 20 and (d) τ = 1 50 .These plots confirm that even for relatively large values of τ , e.g. 10 and 1 20 , the gradient of the Wasserstein distance is sufficiently accurate approximated so as to obtain valid descent directions for distance minimization.

FIGURE 7
FIGURE 7.1.The normalized average entropy (6.12) as a function of iterations for the smoothing parameter value τ = 0.1.With increasing values for α the entropy drops more rapidly and hence converges faster to an integral labeling.

FIGURE 7
FIGURE 7.2.Influence of the rounding parameter α and the smoothing parameter τ on the assignment of 8 prototypical labels to noisy input data.All images marked with an ' * ' do not show integral solutions because too large smoothing of the Wasserstein distance in terms of τ , relative to α, overcompensated the effect of rounding.Likewise, a too strong smoothing of the Wasserstein distance (left column, τ = 0.5) yields poor approximations of the objective function gradient and to erroneous label assignments.For the remaining parameter regime, i.e. smoothing below a still reasonably large upper bound τ = 0.1 that leads to fast numerical convergence, the label assignment can be precisely controlled by α.
Figure 7.3, right panel, lists the 8 vertices of P K 3 and the 4 additional vertices of L K 3 that arise when dropping the subset of constraints (7.6).We evaluated 10 5 models generated by randomly sampling the model parameters (2.11):With U[a, b] denoting the uniform distribution on the interval [a, b] ⊂ R, we set FIGURE 7.7.Results for the noisy labeling problem from Fig. 7.6 using a standard data term with Potts prior, with Energy / Accuracy values.Parameter values for the geometric approach: smoothing τ = 0.1, step-size h = 0.2 and rounding strength α = 0.1.The threshold for the termination criterion was 10 −4 .All methods show equal performance.

FIGURE 7 .
FIGURE 7.10.Results of the labeling problem using the Potts and non-Potts prior model together with the Accuracy (Acc) values.Parameters for this experiment are ρ = 15, smoothing τ = 0.01, step-size h = 0.1 and rounding strength α = 0.01.The threshold for the termination criterion (6.12) was 10 −4 .