On the Brittleness of Bayesian Inference

With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is a pressing question for which there currently exist positive and negative results. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they could be generically brittle when applied to continuous systems (and their discretizations) with finite information on the data-generating distribution. If closeness is defined in terms of the total variation metric or the matching of a finite system of generalized moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions. The mechanism causing brittlenss/robustness suggests that learning and robustness are antagonistic requirements and raises the question of a missing stability condition for using Bayesian Inference in a continuous world under finite information.

The application of Bayes' theorem in the form of Bayesian inference has fueled an ongoing debate with practical consequences in science, industry, medicine, and law [21]. One commonly-cited justification for the application of Bayesian reasoning is Cox's theorem [15], which has been interpreted as stating that any "natural" extension of Aristotelian logic to uncertain contexts must be Bayesian [34]. It has now been shown that Cox's theorem as originally formulated is incomplete [28] and there is some debate about the "naturality" of the additional assumptions required for its validity [1,20,29,31], e.g., the assumption that knowledge can be always represented in the form of a σ-additive probability measure that assigns to each measurable event a single real-valued probability.
However-and this is the topic of this article-regardless of the internal logic, elegance, and appealing simplicity of Bayesian reasoning, a critical question is that of the robustness of its posterior conclusions with respect to perturbations of the underlying models and priors.
For example, a frequentist statistician might ask, if the data happen to be a sequence of i.i.d. draws from a fixed data-generating distribution μ † , whether or not the Bayesian posterior will asymptotically assign full mass to a parameter value that corresponds to μ † . When it holds, this property is known as frequentist consistency of the Bayes procedure, or the Bernstein-von Mises property. Alternatively, without resorting to a frequentist data-generating distribution μ † , a Bayesian statistician who is also a numerical analyst might ask questions about stability and conditioning: does the posterior distribution (or the posterior value of a particular quantity of interest) change only slightly when elements of the problem setup (namely, the prior distribution, the likelihood model, and the observed data) are perturbed, e.g., as a result of observational error, numerical discretization, or algorithmic implementation? When it holds, this property is known as robustness of the Bayes procedure.
This paper summarizes recent results [46,47] that give conditions under which Bayesian inference appears to be nonrobust in the most extreme fashion, in the sense that arbitrarily small changes of the prior and model class lead to arbitrarily large changes of the posterior value of a quantity of interest. We call this extreme nonrobustness "brittleness," and it can be visualized as the smooth dependence of the value of the quantity of interest on the prior breaking into a fine patchwork, in which nearby priors are associated to diametrically opposed posterior values. Naturally, the notion of "nearby" plays an important role, and this point will be revisited later.
Much as classical numerical analysis shows that there are "stable" and "unstable" ways to discretize a partial differential equation (PDE), these results and the wider literature of positive [8,13,19,37,38,53,56] and negative [3,17,23,24,35,40] results on Bayesian inference contribute to an emerging understanding of "stable" and "unstable" ways to apply Bayes' rule in practice.
The results reported in this article show that the process of Bayesian conditioning on data at finite enough resolution is unstable (or "sensitive" as defined in [54]) with respect to the underlying distributions (under the total variation (TV) and Prokhorov metrics) and is the source of negative results similar to those caused by tail properties in statistics [2,18]. The mechanisms causing the stability/instability of posterior predictions suggest that learning and robustness are conflicting requirements and raise the possibility of a missing stability condition when using Bayesian inference for continuous systems with finite information (akin to the Courant-Friedrichs-Lewy (CFL) stability condition when using discrete schemes to approximate continuous PDEs).

Bayes' Theorem and Robustness.
To begin, let us consider a simple example of Bayesian reasoning in action: Problem 1. Consider a bag containing 102 coins, one of which always lands on heads, while the other 101 are perfectly fair. One coin is picked uniformly at random from the bag, flipped 10 times, and 10 heads are obtained. What is the probability that this coin is the unfair coin?
The correct probability is given by applying Bayes' theorem: (1) where A is the event "the coin is the unfair coin" and B is the event "10 heads are observed." If the number of coins is not known exactly and the supposedly fair coins are not exactly fair, then Bayes' theorem produces a robust inference in the following sense: if the fair coins are slightly unbalanced and the probability of getting a tail is 0.51, and if an estimate of 100 coins is used and an estimate 1 2 of the fairness of the fair coins is used, then the resulting estimate 1 1+99×2 −10 is still a good approximation to the correct answer. Observe also that if the prior estimate of the number of coins in the bag is grossly wrong (e.g., 10 6 ), then the posterior would still be accurate in the limit of infinitely many coin flips: in this case, the Bayesian estimator is said to be consistent.
Do these conclusions remain true when the underlying probability space is continuous or an approximation thereof? For example, what if the random outcomes are decimal numbers-perhaps given to finite precision-rather than heads or tails?
The General Problem and Its Bayesian Answer. Problem 2. Let X denote the space in which observations/samples take their values, and let M(X ) denote the set of probability measures on X . Let Φ: M(X ) → R be a function 1 defining a quantity of interest. Let the data-generating distribution μ † ∈ M(X ) be an unknown or partially known probability measure on X . The objective is to estimate Φ(μ † ) from the observation of n i.i.d. samples from μ † , which we denote by d = (d 1 , . . . , d n ) ∈ X n . Example 1. When X is the real line R, a prototypical example of a quantity of interest is Φ(μ) := μ[X ≥ a], the probability that the random variable X distributed according to μ exceeds the threshold value a. However, the results that we report below apply to any prespecified quantity of interest Φ.
The Bayesian answer to this problem is to model μ † 's generation of sample data as coming from a random measure on X and to condition Φ with respect to the observation of the n i.i.d. samples. This is done by choosing a model class A ⊆ M(X ) and a probability measure π ∈ M(A) which we call the prior. This prior determines the randomness with which a representative μ ∈ A is selected, and, for each such μ ∈ A, the generation of n i.i.d. samples d ∈ X n by randomly sampling from μ n naturally determines a product measure on A×X n . The prior estimate of the quantity of interest is E μ∼π [Φ(μ)] and, for an open 2 B ⊆ X n , the posterior estimate is defined as the conditional expectation E μ∼π,d∼μ n [Φ(μ)|d ∈ B] with respect to this product measure.
The connection to the standard presentation of Bayesian inference in terms of a prior on a parameter space is as follows: to construct a model class A ⊆ M(X ) and a prior π 0 ∈ M(A) from a Bayesian parametric model P : Θ → M(X ) defined on a parameter space Θ equipped with a prior p 0 ∈ M(Θ), one simply pushes forward under the map P. That is, the model class A ⊆ M(X ) is defined by A := P(Θ) and the prior π 0 ∈ M(A) is defined as the push-forward π 0 := Pp 0 of p 0 by the model P, i.e., π 0 (E) := p 0 (P −1 (E)) for measurable E ⊆ A.
Inconsistency under Misspecification. We now discuss the effects of misspecification on a Bayesian parametric model P : Θ → M(X ). It is convenient to denote such a model by P : θ → μ(θ), so that the model class is A := P(Θ) = {μ(θ) | θ ∈ Θ}.
1 All spaces will be topological spaces, the term "function" will mean Borel measurable function, and "measure" will mean Borel measure. 2 We assume B to be open and of strictly positive measure to avoid problems associated with conditioning with respect to events of measure zero.
If the model class P(Θ) contains the data-generating distribution μ † , i.e., if there is some parameter value θ ∈ Θ such that μ † = μ(θ), then the model is said to be well-specified ; otherwise, it is said to be misspecified.
For simplicity, consider the classical case where, for each θ ∈ Θ, μ(θ) has a probability density function with respect to some common reference measure on X , that is, μ(θ) = p( · , θ) dx for some measure dx. Then, for a prior p 0 ∈ M(Θ), let p n ∈ M(Θ) denote the posterior distribution on Θ after observing the data d (see, e.g., [5, p. 126]) and push forward both the prior and posterior to their corresponding measures, π 0 := Pp 0 and π n := Pp n , on M(A).
Now suppose that the model is well-specified and that p 0 gives strictly positive mass to every neighborhood of every point θ ∈ Θ-this assumption of "maximal openmindedness" is commonly referred to as Cromwell's rule [41]. Then, when Θ is finitedimensional, under suitable regularity conditions, the posterior value of the quantity of interest E μ∼πn Φ(μ) converges to Φ(μ † ) as n → ∞. This convergence, which can be shown to be asymptotically normal, is commonly referred to as the Bernstein-von Mises theorem or Bayesian central limit theorem [8,19,38,56]. However, for infinitedimensional Θ and with similar regularity and strict positivity assumptions, there is a wealth of positive [13,37,53] and negative [3,17,23,24,35,40] results showing that the truth or otherwise of the Bernstein-von Mises property depends sensitively on subtle topological and geometrical details.
Example 2. To illustrate this, let X = R and consider the Gaussian model μ(c, σ) with mean c and standard deviation σ, that is, with the probability density and the expected log-likelihood If, for a data-generating distribution μ † with finite second moments, we let c † denote its mean and σ † its standard deviation, then a quick calculation shows that θ * = (c * , σ * ) maximizes the expected log-likelihood if and only if c * = c † and σ * = σ † . Hence, the asymptotic Bayesian posterior estimate of whereas the extreme cases that prove the sharpness of Chebyshev's inequality-in which the probability measure is a discrete measure with support on at most three points in R-have In the case of the archetypically rare "6σ event," i.e., t = 6, the ratio between the two is approximately 1.4 × 10 7 . This comparison is, of course, almost perversely extreme: it would be obvious to any observer with only moderate amounts of "Chebyshev-type" sample data that the data had been drawn from a highly non-Gaussian distribution. However, it is not inconceivable that the true distribution μ † has a Gaussian-looking bulk but also has tails that are significantly fatter than those of a Gaussian, and the difference may be difficult to establish using reasonable amounts of sample data; however, it is those tails that drive the occurrence of "Black Swans," catastrophically high-impact but low-probability outcomes.
Although it is understood that Bayesian estimators can be inconsistent if the model is grossly misspecified, a pressing question is whether they have good convergence properties when the model class {μ(θ) | θ ∈ Θ} is "close enough" to the truth μ † in an appropriate sense.
Such concerns can be traced back to Box's dictum that "essentially, all models are wrong, but some are useful" [12, p. 424] and question "how wrong do they have to be to not be useful?" [12, p. 74]. These queries are also critical because, although gross misspecification of the model can be detected before engaging in a complete Bayesian analysis [32,61], usually one cannot be sure that the model is well-specified.
To answer these questions we will examine the robustness of Bayesian inference by computing optimal bounds on prior and posterior values in terms of given sets of priors. Indeed, the exploration of classes of Bayesian models is one response to the concern that the choice of prior-likelihood combination could, to some degree, be arbitrary, and this forms the basis of the approach known as robust Bayesian inference [4,6,11,58,60]. To do so, we need some definitions.
denote the optimal lower and upper bounds on the prior values of Φ. For B a nonempty open subset of the data space X n , let Π B ⊆ Π be the subset of priors π such that the probability that d ∈ B is nonzero, i.e., P μ∼π,d∼μ n [d ∈ B] > 0, and let  the prior by π 0 ∈ M A 0 . To quantify perturbations in the model and define what it means for two distributions to be close to one another, we select a metric ρ on M(X ). As illustrated in Figure 1, for α > 0, we enlarge the set A 0 to its metric neighborhood A α and thereby naturally determine a set of priors Π α ⊆ M(A α ) such that the random measure μ α associated with every π α ∈ Π α lies within distance α of the random measure μ 0 associated with the prior μ 0 and the Bayesian model P.
Then we analyze the robustness of its posteriors, as in Definition 1, with respect to these size-α perturbations.
To that end, suppose that X is metrizable and select a consistent metric d for X . Let B(X ) denote the Borel subsets of X . We will consider two metric distances ρ(μ, ν) between μ, ν ∈ M(X ): ρ will be either the TV metric For α > 0, the neighborhood A α of A 0 emerges naturally from the ball fibration in the sense that if P 0 and P α denote the projections onto the first and second compo- To state our result, consider again Problem 2 and let some x n : is the open ball of radius δ centered on x ∈ X . Using the notation of Definition 1, and Π α defined above in terms of the TV or Prokhorov metric, the Brittleness Theorem 6.4 of [47] then reads as follows. 4 Theorem 1. If and all x n ∈ X n , Note that condition (2) is extremely weak and is satisfied for most parametric Bayesian models. Furthermore, suppose that Cromwell's rule is applied. Then, although it implies consistency if the model is well-specified, here it leads to maximal brittleness under local misspecification. More precisely, under Cromwell's rule, ess inf π0 (Φ) = inf μ∈A0 Φ(μ) and ess sup π0 (Φ) = sup μ∈A0 Φ(μ), so the conclusion of Theorem 1 becomes In other words, the range of posterior predictions among all admissible priors is as wide as the deterministic range of the quantity of interest Φ. Note that since Φ is arbitrary, the brittleness described in Theorem 1 is not limited to a quantile or moment of μ but concerns its whole posterior distribution.
Brittleness under Finite Information. One response to the concern that the choices of prior and model are somewhat arbitrary [58] is to perform a sensitivity analysis over classes of priors and models. One way to specify a class Π of admissible priors π is to select some "features" (such as the polynomial moments, or other functionals) and specify some values, ranges, or distributions for those features. It is interesting to understand the impact of those features left unspecified, i.e., the codimension and not just the dimension of Π; while robust Bayesian inference [4,6,11,60] has shown that posterior conclusions remain stable when Π is finite-dimensional, our results can be interpreted as saying that brittleness ensues whenever Π has finite codimension, regardless of how large its codimension is. It is important to note that this is in some sense the generic situation: when A is an infinite set, one would have to specify infinitely many features of priors π ∈ Π to achieve a finite-dimensional Π; from a computational and epistemic standpoint, the specification of infinitely many features in finite time appears to be somewhat problematic.
To study this problem, we introduce a representation space Q (e.g., prototypically, R k ) and a mapping Ψ : A → Q from the subset A ⊆ M(X ) into Q, which can be 4 All results of this article and those in [46,47,48] require some mild technical measure-theoretic and topological assumptions. For example, here it is sufficient if P(Θ) is a Borel subset of a Polish space (a separable completely metrizable space). Unfortunately, M(X ) is not generally separable with respect to the TV metric, and hence is not Polish. However, if X is Polish, then M(X ) topologized by weak convergence is Polish and the Prokhorov metric provides a complete metrization of it. Consequently, when Θ is Polish, X is Polish, and P is injective and measurable with respect to the weak topology, it then follows from Suslin's Theorem that P(Θ) is a Borel subset of the Polish space M(X ). For a thorough investigation of such matters, illustrating the benefits of Polish spaces as the foundation for the framework, see [47]. thought of as a map to "generalized moments." Let Q ⊆ M(Q) be a subset of the set of probability distributions on Q such that each distribution Q ∈ Q has its support contained in Ψ(A). If the set Q represents priors for the distribution of Ψ(μ), μ ∈ A, then a naturally induced set of priors Π on A is the pull-back Π : Repeat this conditioning process on the higher coordinates iteratively in the same manner. Then, the induced set of priors We now state the Brittleness Theorem 4.13 in [47] for the general case of Problem 2 and apply it to Example 3. To that end, let the model class A ⊆ M(X ) be chosen along with a generalized moment map Ψ : A → Q to a representation space Q. Let Q ⊆ M(Q) be a specified set of priors on Q and from them determine is the open ball of radius δ centered on x ∈ X . The following theorem gives optimal bounds on posterior values for the class of priors Π defined above, given that the observation d ∈ B n δ . Theorem 2. Suppose that, for all γ > 0, there exists some Q ∈ Q such that Then

with similar expressions for the lower bounds L.
In other words, if there is a measure Q ∈ Q such that for Q-almost all q ∈ Q, there is a μ ∈ Ψ −1 (q) which achieves an arbitrarily small mass on one of B δ (x i ), i = 1, . . . , n, and with nonzero Q probability there is μ ∈ Ψ −1 (q) which almost extremizes Φ while putting positive mass on all B δ (x i ), i = 1, . . . , n, then the range L Π B n δ , U Π B n δ of posterior values for Φ is exactly the "deterministic" range of Φ, i.e., inf μ∈A Φ(μ), sup μ∈A Φ(μ) .
Conditions (3) and (4) are very weak, and simple dimensionality arguments suggest that they are typically satisfied if Q is finite-dimensional. Hence, although Bayesian inference is robust in situations where the distributions of all but finitely many generalized moments of the data-generating distribution μ † are known, Theorem 2 suggests that it is brittle when the distributions of only finitely many generalized moments of μ † are known, while infinitely many remain unknown. As an example, it is instructive to observe how Theorem 2, applied to Example 3 in [47,Ex. 4.16], shows that if the data-generating measure has some nonatomic component, then when the number of samples n is large enough and δ small enough, the optimal bounds on posterior values of Φ(μ) = E μ [X], given the distribution Q defined on its moments, are 0 and 1.
To quantify "large enough" and "small enough" and to remove the "nonatomic" requirement above, Theorem 3.1 of [46] provides a quantitative version of Theorem 2 in which the conditions of the theorem are only required to hold approximately. When applied to Example 3 with the set Π := Ψ −1 Q of priors generated instead by the uniform prior Q restricted to the truncated moment space, Theorem 3.3 of [46] establishes that, although the prior value satisfies U(Π) = 1 2 , the posterior value satisfies Consequently, regardless of the number of moment constraints k and the location of a single data point, for δ smaller than an elementary known function of k, we have brittleness. This result also holds for arbitrary multiple samples. Remark 4.18 of [47] also suggests that brittleness would persist if the hard bound δ to specify measurement uncertainty were replaced by a level of noise with variance decreasing with δ.

Mechanism Causing Brittleness.
We will now illustrate one mechanism causing brittleness with a simple example derived from the proof of Theorem 1. In this example we are interested in estimating Φ(μ † ) = E μ † [X], where μ † is an unknown distribution on the unit interval (X = [0, 1]) based on the observation of a single data point d 1 = 1 2 up to resolution δ (i.e., we observe d 1 ∈ B δ (x 1 ) with x 1 = 1 2 ). Consider the following two models μ a (θ) and μ b (θ) on the unit interval [0, 1], parameterized by θ ∈ (0, 1) and with densities f a and f b given by where Z is a normalization constant (close to one) chosen so that [0,1] f b (x, θ) dx = 1. See Figure 2 for an illustration of these densities.
Observe that the density of model b is that of model a besides the small gap of width δ c > 0 created around the data point for model b (if θ < 0.999, see Figure 2); since the data point is fixed at x 1 = 1 2 , the TV distance ρ TV μ a (θ), μ b (θ) between the two models is, uniformly over θ ∈ (0, 1), bounded by a constant times δ c . Assuming that the prior distribution on θ is the uniform distribution on (0, 1), observe that the prior value of the quantity of interest E μ [X] under both models (a and b) is approximately 1 2 . Now, when θ is close to 1 (zero), the density of model a puts most of its mass toward 1 (zero). Observe also that the density of model b behaves in a similar way, with the important exception that the probability of observing the data under model b is infinitesimally small for θ < 0.999. Therefore, for δ < δ c , the posterior value of the quantity of interest E μ [X] under model a is 1 2 , whereas it is close to 1 under model b. Observe also that a perturbed model c analogous to b can be constructed to lead to a posterior value close to zero. The mechanism described here is generic and μ b (θ) is a simple example of what worst priors can look like after a classical Bayesian sensitivity analysis over a class of priors specified via constraints on the TV or Prokhorov distance or the distribution of a finite number of moments.
Can these worst priors be dismissed because they depend on the data? The problem with this argument is that, in the context of Bayesian sensitivity analysis, worst priors always depend on (or are preadapted to) the data. Therefore, the same argument would lead to a dismissal of Bayesian sensitivity analysis and therefore the framework of robust Bayesian inference. In some sense, the brittleness results reported here can be seen as extreme occurrences of the dilation property [59] which, in robust Bayesian inference, refers to the enlargement of optimal bounds caused by the data dependence of worst priors. Indeed, even if perturbations are quantified in KL divergence, the local sensitivity analysis (in the sense of Fréchet derivatives) of posterior values [27] shows infinite sensitivity as the number of data points goes to infinity (and this result is valid for the broader class of divergences that includes the Hellinger distance).
Can these worst priors be dismissed because they can "look unrealistic" and make the probability of observing the data very small? The problem with this argument is that these worst priors are not "isolated pathologies" but directions of instability (of Bayesian conditioning) increasing with the number of data points and the complexity of the system under investigation. We will illustrate this point with another simple example that also shows that these instabilities are the price to pay for the learning potential of Bayesian inference. Observe that Π corresponds to the assumption that μ † is the realization of a random measure on [0, 1] whose mean is on average m. As in the previous example, the finite codimensional class of priors Π leads to brittleness in the sense that the least upper bound on prior values is U(Π) = m a , whereas (for δ 1/n) the least upper bound on posterior values is the deterministic supremum of the quantity of interest (over A), i.e., U(Π|B n δ ) = 1. Furthermore, worst priors are obtained by selecting priors for which the probability of observing the data μ n [B n δ ] is arbitrarily close to zero except when Φ(μ) is close to its deterministic supremum.

Learning and Robustness
Can this brittleness be avoided by adding a uniform constraint on the probability of observing the data in the model class? To investigate this question, let us introduce α ≥ 1 and a probability measure μ 0 on [0, 1] with strictly positive Lebesgue density (with μ 0 being the uniform measure on [0, 1] as a prototypical example) and consider the (new) model class and the (new) class of priors where, in (6), B n δ := n i=1 B δ (x i ) and (x 1 , . . . , x n ) ∈ [0, 1] n is fixed. Note that, for the model class A(α), the probability of observing the data is uniformly bounded from below by 1 α μ n 0 [B n δ ] and from above by αμ n 0 [B n δ ]. Therefore, for α = 1, the probability of observing the data is uniform in the model class, prior values are equal to posterior values, and the method is robust but learning is impossible. On the other hand, if α slightly deviates from 1, then the calculus developed in [47] (Theorems 4.8 and 4.13) gives .
Note that the right-hand side of (7) is equal to m/a for α = 1 (when the probability of the data is constant on the model class) and quickly converges toward 1 as α increases. As a numerical application observe that for a = 3 4 and m = a 2 = 3 8 , we have lim δ→0 U Π(α) = 1 2 and lim δ→0 U Π(α)|B n δ = Therefore, for α = 2, we have (irrespective of the number of data points) Moreover, if α is derived by assuming the probability of each data point to be known up to some tolerance γ, i.e., if the model class A(α) is replaced by , which exponentially converges toward 1 as the number n of data points goes to infinity.
In conclusion, the effects of a uniform constraint on the probability of the data under finite information in the model class shows that learning ability comes at the price of loss in stability in the following sense: when α = 1, the data is equiprobable under all measures in the model class, posterior values are equal to prior values, and the method is robust but learning is not possible. As α deviates from 1, the learning ability increases as robustness decreases, and when α is large, learning is possible but the method is brittle.
Qualitative Robustness and Consistency. Since the data dependence of worst priors is inherent to classical Bayesian sensitivity analysis, one might ask whether robustness could be established under finite information by leaving the strict framework of robust Bayesian inference and computing the sensitivity of posterior conclusions independently of the specific value of the data. Indeed, in the current classical Bayesian sensitivity analysis framework, given a class of priors Π and the observation d ∈ B n δ (x), we compute which corresponds to the sensitivity of posterior values (given the value of the data) with respect to the particular choice of prior π ∈ Π. Therefore, the interpretation of the brittleness mechanisms discussed above should be limited to the significance of such optimal bounds, which are not the sole measure of robustness of a Bayesian estimation. An alternative analysis could be to quantify the sensitivity of the distribution of posterior values. For instance, given a class of priors Π ⊂ M(X ) over a model class A ⊆ M(X ), the value of sup π,π ∈Π,ν∈A is the least upper bound on the probability that posterior values derived from π, π ∈ Π and randomized through an admissible candidate ν ∈ A for the distribution of the data deviate by at least > 0. This form of analysis is directly related to Hampel [30] and Cuevas' [16] notion of qualitative robustness, which requires closeness in distributions of the posterior distribution rather than in posterior distributions. More precisely, given a metric ρ 2 on M(M(A)), a qualitative sensitivity analysis would seek to bound ρ 2 (π * ν n , π * ν n ) (over π, π ∈ Π and ν ∈ A), where π * ν n ∈ M(M(A)) is the distribution of the posterior distribution of the prior π ∈ M(A) when the data d = (d 1 , . . . , d n ) is randomized through ν n . If, unlike Hampel and Cuevas who require "closeness for all n," we follow Huber [33] and Mizera [44] in only requiring closeness "for large enough n" (i.e., in the limit as the number of data points tends to infinity), then we obtain [45] a notion of qualitative robustness, where the notion of consistency (i.e., the property that posterior distributions convergence toward the data-generating distribution) plays an important role. Although consistency is primarily a frequentist notion, according to Blackwell and Dubins [10] and Diaconis and Freedman [17], consistency is equivalent to intersubjective agreement, which means that two Bayesians will ultimately have very close predictive distributions. Fortunately, not only are there mild conditions which guarantee consistency, but the posterior distributions can be shown to contract/concentrate at an exponential rate around the data-generating distribution (see [55] for rates of contraction of posterior distributions based on Gaussian process priors) and the Bernstein-von Mises theorem goes further in providing mild conditions under which the posterior is asymptotically normal [13,14]. The most famous of these are Doob [19], Le Cam and Schwartz [39], and Schwartz [50,Thm. 6.1]. Unfortunately, the conditions ensuring consistency (e.g., the condition that the prior has KL support at the parameter value generating the data 5 ) are such that arbitrarily small (TV or Prokhorov) local perturbations of the prior distribution (near the data-generating distribution) may result in consistency or non-consistency, and therefore may have large impacts on the asymptotic behavior of posterior distributions [45]. A simple illustration of this mechanism is as follows [45]. Suppose that the datagenerating distribution ν is at distance τ > 0 from the support of the prior π. Let π 1 be a prior distribution with all of its mass on or around ν (having KL support at ν). Take π := (1 − )π + π 1 . The TV distance from π to π is bounded by , which can be chosen to be arbitrarily small. Furthermore, π inherits the KL support of π 1 at ν and by Schwartz's consistency theorem [50] its posterior distribution converges (almost surely) toward a Dirac concentrated at ν as n → ∞. On the other hand, the distance between the support of the posterior distribution of π and ν remains bounded by τ . This simple example exposes a serious challenge to proving robustness in the TV metric or any weaker metric, such as those used in the convergence of MCMC.
Of course, in a parametric setting, if the parameter space Θ is compact and the model well-specified (the data generated from a parameter in that space), then choosing a prior satisfying Cromwell's rule (putting mass in the KL neighborhood of all parameters) ensures qualitative robustness (and the degree of robustness is a function of how much mass is placed in each neighborhood). However, if Θ is compact and the model is misspecified, then, even if the prior is nice and smooth, the mechanism discussed above suggests that it is not qualitatively robust (with a degree of nonrobustness corresponding to the degree of misspecification; the prior does not need to look "unrealistic" to be nonqualitatively-robust). Note also that if Θ is noncompact, then the prior cannot be qualitatively robust (because no matter how small is, one can always find a neighborhood of the parameter space with mass smaller than ).
In a nonparametric setting, consistent priors (such as the ones analyzed in [55] with bounds on convergence rates) remain good/natural choices when their posterior distributions can be computed. However, consistency and robustness are to some degree conflicting requirements [16,45] from the point of view of a numerical analyst. Consider, for instance, the problem of using a sophisticated numerical Bayesian model to predict the climate where Bayes rule is applied iteratively and posterior values become prior values for the next iteration. How do we make sure that our predictions are robust, not only with respect to the choice of prior but also with respect to numerical instabilities arising in the iterative application of the Bayes rule? The nonrobustness mechanisms discussed here suggest that, unless the prior is chosen carefully, and unless we have a tight control on numerical instabilities, errors, and approximations at each step of the iteration, our final predictions might be unstable. Note that, often, these posterior distributions (which are later on used as prior distributions) are only approximated (e.g., via MCMC methods), and so how do we go about ensuring the stability of our method in such situations? The brittleness results discussed here suggest that having strong convergence of our MCMC method in TV would not be enough to ensure stability. Note in particular that although quantifying perturbations in KL ensures qualitative robustness, it would also require controlling the convergence of the MCMC method in KL or in a stronger metric.

Conclusion and Perspectives.
It is possible that an analogy can be made between the brittleness and robustness properties of Bayesian inference and the numerical analysis of PDEs, for which many pathologies and also many necessary and/or sufficient stability conditions are known. However, in contrast to conditions such as the well-known CFL condition for PDEs, the question of the existence and nature of a stability condition when using Bayesian inference under finite information remains to be resolved. Although numerical schemes that do not satisfy the CFL condition may look grossly inadequate, the existence of such perverse examples certainly does not imply the dismissal of the necessity of a stability condition. Similarly, although one can, as in the example provided in Figure 2, exhibit grossly perverse worst priors, the existence of such priors does not invalidate the need for a study of stability conditions when using Bayesian inference under finite information. The example provided in (7) suggests that, in the framework of Bayesian sensitivity analysis, such a stability condition would depend on (i) how well the probability of the data is known or constrained in the model class, and (ii) the resolution at which the quantity of interest is conditioned upon the data. Note that the independence of the brittleness threshold δ c from the number of data points n in Theorem 1 suggests that taking δ fixed and n → ∞ does not prevent brittleness in the classical Bayesian sensitivity analysis framework (it only leads to more directions of instabilities). On the other hand, for a fixed δ, (5) suggests that brittleness results do not persist in that same framework when the number of moment constraints k (on the class of priors) is large enough. Furthermore, taking δ > 0 fixed (or discretizing space at a resolution δ > 0) enables the construction of classes of qualitatively robust priors (to TV perturbations) that are nearly consistent as n → ∞ (some degree of consistency is lost due to the discretization). At a higher level, the mechanisms discussed here appear to suggest that robust inference (in a continuous world under finite information) should perhaps be done with reduced/coarse models rather than highly sophisticated/complex models (with a level of "coarseness/reduction" depending on the available "finite information"). In the context of deterministic modeling versus uncertainty quantification, Stuart [53] asked, "should future increased computer resources be invested in further model resolution, or in more detailed study of uncertainty?" The results reported here suggest that the answer is the latter, at least in the context of Bayesian modeling versus robustness studies, because posterior conclusions become nonrobust if model resolution is pushed beyond a threshold defined by model uncertainties.
A close inspection of some of the cases where Bayesian inference has been successful suggests the existence of a non-Bayesian feedback loop on the evaluation of its performance [43,51,42]. Therefore, one natural question is whether the missing stability condition could also be derived by exiting the strictly Bayesian framework, as proposed in [21]. One example of such an approach could be using posterior predictive checking [26], [25, p. 159], whose rationale is to detect model mismatch by generating replicate data from the model, and comparing this replicate data to the original data using statistics related to the quantity of interest.
It is natural to expect that robustness and stability questions will increase in importance as Bayesian methods become more popular with the availability of computational methodologies and environments to compute the posteriors. Another strong motivation for considering Bayesian methods and investigating such questions is the complete class theorem, which, in the adversarial game theoretic setting of decision theory [57], asserts that optimal statistical estimators (leading to optimal decisions as defined by a convex loss function on a compact parameter space) live in the Bayesian class of estimators [57,22].