# Evaluating Forecasts for High-Impact Events Using Transformed Kernel Scores

## Abstract.

To account for uncertainties, forecasts for future events are commonly expressed in terms of probability distributions over the set of possible outcomes. To evaluate the quality of such forecasts, it is customary to employ proper scoring rules, which provide an objective framework with which to compare forecasters. When evaluating forecasts, it may be useful to emphasize outcomes that have a high impact on the forecast user. Weighted scoring rules have become a well-established tool to achieve this. However, while impacts may result from the interaction of events along multiple dimensions, weighted scores have been studied almost exclusively in the univariate case. This paper introduces weighted multivariate scoring rules, which can emphasize multivariate outcomes of interest when evaluating forecast performance. We demonstrate that the threshold-weighted continuous ranked probability score (twCRPS), arguably the most well-known weighted scoring rule, belongs to the class of kernel scores, and we use this to construct univariate and multivariate generalizations of the twCRPS. These generalizations include weighted versions of popular multivariate scoring rules, such as the energy score and variogram score. This result facilitates the introduction of weighted scoring rules that are appropriate in many practical applications, which also enjoy the theoretical framework associated with kernel scores. To illustrate the additional information that these novel weighted scoring rules provide, results are presented for a case study in which they are used to evaluate daily precipitation accumulation forecasts, with particular emphasis on events that could lead to flooding.

### Keywords

### MSC codes

## 1. Introduction.

When predicting future events, it is common to issue forecasts that are probabilistic, thereby comprehensively quantifying the uncertainty of the outcome. Just as important as issuing a forecast is understanding how it is expected to perform. In achieving this, forecasters gain a greater awareness of the strengths and limitations of their predictions and, in turn, learn how they can be improved [21]. It is convenient to condense all information regarding forecast performance into a single numerical value, or score, allowing competing forecast strategies to be objectively ranked and compared. For probabilistic forecasts, this can be achieved using scoring rules. Scoring rules are functions of the formwhere \( \mathcal{M}\) is a suitable class of probability measures over the measurable outcome space (\( \mathcal{X}, \mathcal{A}\)). We restrict our attention to negatively oriented scoring rules, for which a lower score indicates a more accurate forecast. The score assigned to a forecast can therefore be interpreted as a loss.

\begin{align*} S \colon \mathcal {M} \times \mathcal {X} \to \mathbb {R} \cup \{ -\infty, \infty \}, \end{align*}

While scoring rules have been introduced as a means to evaluate probabilistic forecasts, they are also applicable in other contexts. For example, probabilistic models are regularly used to describe the uncertainties present when emulating complex physical processes, such as blood flow [33], tsunami wave height [3], or sea level rise due to ice sheet melting [7]. Although these models are often evaluated via associated point predictions, using the mean squared error, for example, the uncertainty estimates obtained from probabilistic models can be interpreted as probabilistic forecasts for the target variable. Scoring rules can thus be employed to assess whether or not these probabilistic models accurately characterize the uncertainty in this outcome.

It is widely accepted that scoring rules should be proper. Let \( S(P, Q) = \mathbb{E}_{Q}[S(P, Y)]\) denote the expectation of \( S(P, Y)\) when \( Y \sim Q\). A scoring rule \( S\) is proper with respect to \( \mathcal{M}^{\prime } \subset \mathcal{M}\) ifwhere it is assumed that \( S(P, Q)\) exists for all \( P,Q \in \mathcal{M}^{\prime }\), and that \( S(Q, Q)\) is finite. That is, if the observations are believed to arise according to \( Q \in \mathcal{M}^{\prime }\), then the expected value of a proper score is minimized over \(\mathcal{M}^{\prime }\) by issuing \( Q\) as the forecast. If \( Q\) is the unique minimizer of the expected score \(P \in \mathcal{M}^{\prime } \mapsto S(P, Q)\), then the scoring rule is said to be strictly proper.

\begin{align} S(Q, Q) \leq S(P, Q) \quad \text{for all} \quad P, Q \in \mathcal{M}^{\prime }, \end{align}

(1.1)

Proper scoring rules quantify the distance between a probabilistic forecast and the corresponding observation. In practice, certain outcomes typically have a higher impact on the forecast user than others, and it is therefore particularly valuable to evaluate how well forecasters can predict these high-impact events; improving the forecasts made for such outcomes may allow their impacts to be mitigated. However, evaluating forecasts for extreme events is an intrinsically challenging task [40, 5]. A proper scoring rule is generally rendered improper if it is used to evaluate only the predictions made when high-impact events occur, which can result in unreliable conclusions regarding the performance of competing forecasters [17]. This is often referred to as the forecaster’s dilemma [24]. To circumvent the forecaster’s dilemma, it has become common to employ weighted scoring rules, which can direct the evaluation of forecasts to certain outcomes in a theoretically sound way.

The concept of weighted scoring rules dates back at least to [28], though most developments in the field have occurred more recently [17, 11, 19]. For example, [17] introduces two weighted versions of the continuous ranked probability score (CRPS), while [11] proposes two weighted adaptations of the logarithmic score. In [19], the approach followed in [11] is generalised to a framework capable of constructing weighted versions of any proper scoring rule. In order to focus evaluation on particular outcomes, in this paper we say that a weight function is a measurable function from the sample space \( \mathcal{X}\) to \( [0, 1]\).

Weighted scoring rules have been studied almost exclusively in the univariate setting, with interest often placed on extreme events, defined as instances where the outcome exceeds a chosen threshold. Often, however, an impact results from the interaction of several moderate events. A good example of this is a compound weather event, whereby multiple weather hazards combine and interact to generate a high-impact event, despite none of the confounding hazards themselves necessarily being extreme from a statistical perspective; see, e.g., [42]. Currently, there are no established methods for evaluating forecasts made for these high-impact events, despite their obvious significance to forecast users. The present article introduces weighted multivariate scoring rules, which allow emphasis to be placed on regions of a multidimensional outcome space when evaluating forecast accuracy. To achieve this, we utilize kernel scores, a general class of proper scoring rules based on conditionally negative definite (c.n.d.) kernels [16, 9].

Kernel scores are introduced in detail in the following section. This paper then draws connections between kernel scores and weighted scoring rules by identifying kernels that allow high-impact events to be targeted during forecast evaluation. Arguably the most well-known weighted scoring rule is the threshold-weighted continuous ranked probability score (twCRPS) introduced in [17]. In section 3, we demonstrate that the twCRPS is a kernel score. This permits a generalization of the twCRPS to so-called threshold-weighted kernel scores, which we introduce in section 4. Moreover, established results on kernels are leveraged in order to introduce further new approaches to weight kernel scores. These results widen the range of situations in which weighted scoring rules can be applied, and we illustrate this in section 5 by considering outcomes in multidimensional Euclidean space. We introduce weighted energy scores, weighted variogram scores, and a new scoring rule based on a bounded kernel. The utility of these weighted multivariate scoring rules when evaluating forecasts made for high-impact events is presented in a simulation study in section 5, as well as in a case study on flood forecasts in section 6. A discussion of the results is available in section 7, and all proofs are deferred to the appendices.

## 2. Kernel scores.

Kernel scores form a general class of proper scoring rules based on c.n.d. kernels [16, 9]. Here, a negative definite kernel is a symmetric function \( \rho : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) for whichfor all \( n \in \mathbb{N}, x_{1}, \dots, x_{n} \in \mathcal{X},\) and \( c_{1}, \dots, c_{n} \in \mathbb{R}\). A kernel is c.n.d. if the above criterion is satisfied for all \( c_{1}, \dots, c_{n} \in \mathbb{R}\) that sum to zero, and is strictly negative definite if, for distinct \( x_{1}, \dots, x_{n}\), equality in (2.1) holds only when \( c_{1} = 0, \dots, c_{n} = 0\). Conversely, a kernel is said to be positive definite if the inequality in (2.1) is reversed. We assume wherever necessary that the kernels are measurable, and use \(\mathfrak{M}\) to denote the set of all probability measures on \((\mathcal{X}, \mathcal{A})\).

\begin{align} \sum_{i=1}^{n}\sum_{j=1}^{n} c_{i} c_{j} \rho (x_{i}, x_{j}) \leq 0 \end{align}

(2.1)

The final term in (2.2) does not depend on the forecast and is not present in previous definitions of kernel scores [14, 38]. It is included here since it generates a scoring rule that can be interpreted as a divergence between the forecast and a Dirac measure at the outcome, even if \(\rho (y, y) \neq 0\). Several familiar scoring rules fall into this kernel score framework, including the Brier score [6], the CRPS, and the energy score [16]. In section 5, we demonstrate that the variogram score proposed by [35] is also a kernel score.

A scoring rule \(S\) is proper with respect to \( \mathcal{M} \subseteq \mathfrak{M}\) if and only if its divergence function \( d(P, Q) = S(P, Q) - S(Q, Q)\) is nonnegative for all \( P,Q \in \mathcal{M}\). The divergence function corresponding to a kernel score \(S_{\rho }\) iswhere \( X, X^{\prime } \sim P\) and \( Y, Y^{\prime } \sim Q\) are independent. That is, the score divergence between \( P\) and \( Q\) is proportional (by a factor of one half) to the energy distance with respect to \( \rho\) [39]. Energy distances are special cases of squared maximum mean discrepancies (MMDs) [18], and kernel score divergences can be interpreted as squared MMDs under suitable integrability conditions [37]. This provides a natural connection between the optimum score estimation theory introduced by [16] and machine learning algorithms that use the MMD as a loss function; see, e.g., [12, 26]. The following theorem summarizes existing results on the propriety of kernel scores, and is obtained by merging results in [37] and [38]. Previously, a special case of this result was presented in [16].

\begin{align} d_{\rho }(P, Q) = \mathbb{E}_{P,Q} \left [ \rho (X, Y) \right ] - \frac{1}{2} \mathbb{E}_{P} \left [ \rho (X, X^{\prime }) \right ] - \frac{1}{2} \mathbb{E}_{Q} \left [ \rho (Y, Y^{\prime }) \right ], \end{align}

(2.3)

For a c.n.d. kernel \(\rho\), we will often state the assumption that \(S_\rho\) is proper with respect to \(\mathcal{M}_\rho\) or \(\mathcal{M}^\rho\). For concision, it is always assumed in the former case that \(\rho (x, x)=0\) for all \(x \in \mathcal{X}\), and in the latter case that \(\rho\) is negative definite.

The strict propriety of kernel scores relates to injectivity of kernel mean embeddings [38]. If \(\rho = -k\), with \(k\) being a positive definite and bounded kernel, then this is synonymous with the kernel being characteristic [30]. On the other hand, if the c.n.d. kernel \(\rho\) is a metric, then the criterion that (2.3) is zero if and only if \( P=Q\) is exactly the definition given by [27] for \( \rho\) to be a metric of strong negative type.

## 3. Weighted versions of the CRPS.

### 3.1. Definitions and properties.

Let \( \mathcal{M}\) be the set of Borel probability measures on \( \mathcal{X} = \mathbb{R}\) with finite first moment, and identify elements of \( \mathcal{M}\) with their associated distribution functions. A popular scoring rule is the continuous ranked probability score (CRPS), defined aswhere \( X, X^{\prime } \sim F \in \mathcal{M}\) are independent, \( y \in \mathbb{R}\) is the observation, and \( \unicode{x1D7D9}\) denotes the indicator function. In the second expression, \( F^{-1}\) is the lower quantile function or generalized inverse of \( F\).

\begin{align} \begin{split} \mathrm{CRPS}(F, y) & = \int_{\mathbb{R}} (F(z) - \unicode{x1D7D9}\{y\leq z\})^2 \: \mathrm{d} z \\ & = 2\int_{(0,1)} (\unicode{x1D7D9}\{F^{-1}(\alpha ) \geq y\} - \alpha )(F^{-1}(\alpha ) - y) \: \mathrm{d} \alpha \\ & = \mathbb{E}_{F}|X - y| - \frac{1}{2}\mathbb{E}_{F}|X - X^{\prime }|, \end{split} \end{align}

(3.1)

The CRPS is strictly proper with respect to \( \mathcal{M}\) [16]. Its three different representations also partly explain the score’s popularity. The first expression demonstrates that the CRPS is equivalent to the Brier score integrated over all possible thresholds [28], whereas the second expression shows it can also be written as a quantile scoring rule integrated over all quantiles [23]. The final representation illustrates that the CRPS is a kernel score, where the c.n.d. kernel is \(\rho (x, x^{\prime }) = | x - x^{\prime } |\) [16].

Due to their popularity, weighted scoring rules have been studied in detail using the CRPS, with the most well-known version being the threshold-weighted continuous ranked probability score (twCRPS),where \( \nu\) is a Borel measure on \( \mathbb{R}\), often chosen so that it has density equal to a particular nonnegative weight function, \( w\) [28, 17]. Although analytical expressions of the twCRPS have been derived for particular families of parametric distributions (see, e.g., [2]), the integral in (3.2) is often evaluated using numerical techniques. The following proposition provides an alternative representation of the twCRPS as a kernel score, implying a straightforward approach to computing this integral when \( F\) is an empirical distribution function.

\begin{align} \mathrm{twCRPS}(F, y; \nu ) = \int_{\mathbb{R}} (F(z) - \unicode{x1D7D9}\{y\leq z\})^2 \: \mathrm{d} \nu (z), \end{align}

(3.2)

Proposition 3.1 generalizes the three representations of the CRPS in (3.1), which correspond to the case where \( \nu\) is the Lebesgue measure and \( v(z) = z\) (up to a constant) for all \( z \in \mathbb{R}\). The final equality in Proposition 3.1 illustrates that the twCRPS can be interpreted as the CRPS after having deformed the forecasts and observations, where the deformation is governed by the choice of measure or weight function. We refer to \( v\) as the chaining function. If the measure \( \nu\) in the twCRPS is chosen so that it has density \( w\), then the chaining function \( v\) is any function such thatsee [40].

\begin{align*} v(x) - v(x^{\prime }) = \int_{[x^{\prime }, x)} w(x) \: \mathrm {d} x; \end{align*}

Some common weight function and chaining functions are displayed in Figure 1, along with the resulting kernel to be employed in the twCRPS. Knowing the chaining function permits a greater appreciation of what the weight in the twCRPS achieves. For example, if weight is placed only on values above a certain threshold, \( w(z) = \unicode{x1D7D9}\{z \geq t\}\) for \(t \in \mathbb{R}\), then the forecasts and observations are projected onto \( [t, \infty )\), with values lower than the threshold mapped onto \( t\), before calculating the unweighted CRPS. In section 4, we use the chaining function to clarify when the twCRPS is strictly proper.

The kernel score representation of the twCRPS readily extends to the integrated quadratic distance (IQD), the score divergence associated with the CRPS [41]. A threshold-weighted version of the IQD is defined analogously to the twCRPS in (3.2), and there exists a similar representation of this weighted divergence in terms of the kernel \( \rho (x, x^{\prime }) = |v(x) - v(x^{\prime })|\).

The second equality in Proposition 2.1 demonstrates that, like the CRPS, the twCRPS can also be expressed as the integral of a quantile scoring function over all possible quantiles. An \(\alpha\)-quantile scoring function can be written in the form \((\unicode{x1D7D9}\{x \geq y\} - \alpha )(v(x) - v(y))\), with \( v\) increasing (see [13] for details). Proposition 3.1 shows that the integral of such quantile scoring functions over all possible values of \( \alpha\) results in a twCRPS. The quantile score representation of the CRPS has been used to introduce a quantile-weighted version of the CRPS [17]. The weight function emphasizes particular regions of the forecast distribution rather than regions of the outcome space. Since the twCRPS can also be expressed as the integral over quantile scoring rules, it would be straightforward to introduce a weight function into this integral in an analogous way.

The threshold- and quantile-weighted versions of the CRPS were introduced to circumvent the fact that scaling a proper scoring rule using a weight governed by the outcomes results in an improper scoring rule [17]. Let \( S\) denote a scoring rule that is proper with respect to \( \mathcal{M}\), let \( w\) be a weight function (a measurable function with values in \([0, 1]\)), and let \( G \in \mathcal{M}\) with \( \mathbb{E}_{G}[w(X)] \gt 0\). The expectation \( \mathbb{E}_{G}[w(Y)S(F, Y)]\) is minimized by issuing a weighted version of \( G\),rather than \( G\) itself. If \( w\) is a weight function and \( S\) is a proper scoring rule with respect to \(\{ F_{w} | F \in \mathcal{M}, \mathbb{E}_{F}[w(X)] \gt 0\}\), with \( F_{w}\) defined analogously to \(G_{w}\) at (3.4), thendefines a scoring rule that is proper with respect to \( \{ F \in \mathcal{M} | \mathbb{E}_{F}[w(X)] \gt 0 \}\) [19]. Since the weighting in this case is directly dependent on the outcome \( y\), we refer to scoring rules in this form as outcome-weighted scoring rules. For example, the outcome-weighted CRPS is defined as

\begin{align} G_{w}(x) = \frac{\mathbb{E}_{G}\left [ \unicode{x1D7D9}\{ X \leq x\} w(X) \right ] }{\mathbb{E}_{G}[w(X)]}, \end{align}

(3.4)

\begin{align} \mathrm{ow}S(F, y; w) = w(y) S(F_{w}, y) \end{align}

(3.5)

\begin{align*} \mathrm {owCRPS}(F, y; w) = w(y)\int_{\mathbb {R}} (F_{w}(z) - \unicode {x1D7D9}\{y \leq z\})^{2} \: \mathrm {d} z. \end{align*}

To appreciate how the outcome-weighted CRPS (owCRPS) differs from the threshold-weighted CRPS, consider a weight function of the form \( w(z) = \unicode{x1D7D9}\{z \geq t\}\). The twCRPS with this weight function assesses to what extent the forecast can identify whether or not the observation will exceed the threshold \( t\) or any value larger than \( t\). The owCRPS, on the other hand, is concerned with the forecast performance only when the observation \( y\) exceeds this threshold, in which case the forecast is assessed via its conditional distribution given that the threshold has been exceeded. This difference is made more explicit by noting that the owCRPS can be expressed in the following form.

This expression is given in [19] for weight functions of the form \( w(z) = \unicode{x1D7D9}\{z \geq t\}\). This representation of the outcome-weighted CRPS demonstrates that the weight function is applied to the output of the kernel, in contrast to the threshold-weighted version of the CRPS, which involves a prior transformation of the forecasts and outcome.

Unlike the twCRPS, the owCRPS is not a kernel score. Consider a forecast that is a Dirac measure at \( z \in \mathbb{R}\), \( P = \delta_{z}\). From (2.2), it is straightforward to verify that a kernel score \( S_{\rho }\) satisfies \( S_{\rho }(\delta_{z}, y) = \rho (z, y) - \frac{1}{2}\rho (z, z) - \frac{1}{2}\rho (y, y) = S_{\rho }(\delta_{y}, z)\). However, assuming \( w(y), w(z) \gt 0\), we have \( \mathrm{owCRPS}(\delta_{z}, y; w) = |z - y|w(y)\), which is, in general, not equal to \( \mathrm{owCRPS}(\delta_{y}, z; w) = |z - y|w(z)\).

### 3.2. Localizing scores.

Outcome-weighted scoring rules exhibit some desirable properties when we are only interested in a particular subset of outcomes. We recall the definitions of localizing and strictly or proportionally locally proper scoring rules given in [19].

Let \(\mathcal{M}\) be a class of probability measures on a measurable space \((\mathcal{X}, \mathcal{A})\), let \( S\) be a scoring rule, and let \( w\) be a weight function. The scoring rule \(S\) is called

*localizing with respect to*\( w\) if, for any \( P, Q \in \mathcal{M}\), \( P(\cdot \cap \{ w \gt 0\} ) = Q( \cdot \cap \{w \gt 0\} )\) implies that \( S(P, y) = S(Q, y)\) for all \( y \in \mathcal{X}\). Here, \( \{w \gt 0\} = \{x \in \mathcal{X} | w(x) \gt 0\}\). That is, a weighted scoring rule is localizing if it depends only on the forecast measure restricted to outcomes for which the weight function is positive. Clearly, however, if \( \{ w=0 \} = \{x \in \mathcal{X} | w(x)=0\}\) is nonempty, then such a scoring rule typically will not be strictly proper.Let \(S\) be proper with respect to \( \mathcal{M}\) and localizing with respect to \(w\). Then, \(S\) is called

*strictly locally proper with respect to \(w\) and*\( \mathcal{M}\) if \( S(P, Q) = S(Q, Q)\) implies \( P(\cdot \cap \{ w \gt 0\} ) = Q( \cdot \cap \{w \gt 0\} )\) for any \( P, Q \in \mathcal{M}\). The scoring rule \(S\) is called*proportionally locally proper with respect to*\(w\)*and*\( \mathcal{M}\) if \( S(P, Q) = S(Q, Q)\) holds if and only if \( P(\cdot \cap \{ w \gt 0\} ) = cQ( \cdot \cap \{w \gt 0\} )\) for some constant \( c \gt 0\) that depends on \( P\) and \( Q\) for any \(P, Q \in \mathcal{M}\).If \( S\) is proper with respect to \(\{ P_{w} | P \in \mathcal{M}, \mathbb{E}_{P}[w(X)] \gt 0\}\), the outcome-weighted version of this score (constructed via (3.5)) will be proper and localizing with respect to \( \{ P \in \mathcal{M} | \mathbb{E}_{P}[w(X)] \gt 0\}\), and if \( S\) is strictly proper, then the outcome-weighted score will additionally be proportionally locally proper [19, Theorem 3]. In order to obtain a strictly locally proper weighted scoring rule, an outcome-weighted score can be complemented with a strictly proper scoring rule \( S_{0}\) for probability forecasts for binary events, such as the logarithmic score or the Brier score,where \( C_{w}(P)\) is defined as in Proposition 3.2 and is interpreted here as a probability forecast. If ow\(S\) is proportionally locally proper, then \( \tilde{S}\) will be strictly locally proper [19, Theorem 3]. To understand why, note that proportionally locally proper scores only evaluate the shape of the restriction of \( P\) to \( \{ w \gt 0 \}\), whereas the binary score included in (3.7) allows \( \tilde{S}\) to additionally assess the measure that is assigned to the sets \( \{ w=0 \}\) and \( \{ w \gt 0 \}\).

\begin{align} \tilde{S}(P, y) = \mathrm{ow}S(P, y; w) + \{ w(y) S_{0}(C_{w}(P), 1) + (1 - w(y)) S_{0}(C_{w}(P), 0) \}, \end{align}

(3.7)

While outcome-weighted scoring rules are localizing by construction, the twCRPS is only localizing with respect to particular weight functions. The twCRPS is not localizing with respect to a weight that is an indicator function of a compact interval over the real line, \( w(z) = \unicode{x1D7D9}\{a \leq z \leq b\}\) for \(a,b \in \mathbb{R}\), whereas it is localizing if the weight is the indicator function of a one-sided interval, e.g., \( w(z) = \unicode{x1D7D9}\{z \geq t\}\). Figure 1 helps to appreciate why this is the case. If the weight is one-sided, then the chaining function maps all elements in \( \{ w=0 \}\) to a single point. On the other hand, if the weight is an indicator of a compact interval, then points in \( \{ w=0 \}\) will be mapped to one of two values, depending on where they lie in relation to the interval. As a result, the twCRPS depends not only on the forecast distribution within the region of interest, but also on the measure assigned to outcomes above and below the interval bounds. If the weight is an indicator function of a one-sided interval, the twCRPS is localizing and strictly locally proper [19, Theorem 4].

### 3.3. Weighted scores for ensemble forecasts.

Equations (3.3) and (3.6) allow for straightforward computation of the weighted scoring rules when forecasts are in the form of an empirical distribution function, or an ensemble, that is, \( F_{ens}(x) = \sum_{m=1}^{M} \unicode{x1D7D9}\{x_{m} \leq x \}/ M\), for \( x_{1}, \dots, x_{M} \in \mathbb{R}\). As remarked in [14], the kernel score representation of the CRPS makes it “particularly convenient when \( F\) is represented by a sample, possibly based on Markov chain Monte Carlo output or forecast ensembles,” which is ordinarily the case in weather and climate forecasting; see, e.g., [25]. Using the kernel score representation of the twCRPS at (3.3), we obtainSubstituting the identity for \(v\) in (3.8) recovers the corresponding well-known expression for the CRPS as a special case. From an implementation standpoint, this expression has the benefit of involving only a (typically straightforward) transformation of the ensemble members and observations before applying existing methods and software to calculate the CRPS.

\begin{align} \textrm{twCRPS}(F_{ens}, y; v) = \frac{1}{M}\sum_{m=1}^{M} |v(x_{m}) - v(y)| - \frac{1}{2M^{2}}\sum_{m=1}^{M}\sum_{j=1}^{M} |v(x_{m}) - v(x_{j})|. \end{align}

(3.8)

Similarly, (3.6) permits the following representation of the outcome-weighted CRPS:where \( \bar{w} = \sum_{m=1}^{M} w(x_{m})/M\). This expression can result in an undefined score if the weight function is not strictly positive, for example, if the weight \( w(z) = \unicode{x1D7D9}\{z \geq t\}\) is employed but all ensemble members fall below the threshold \( t\). This is more generally the case for continuous forecast distributions that assign zero probability to \( \{ w \gt 0 \}\), meaning \( F_{w}\) is undefined, though this will be more prevalent when dealing with finite ensembles.

\begin{align*} \mathrm {owCRPS}(F_{ens}, y; w) &= \frac {1}{M\bar {w}}\sum_{m=1}^{M}| x_{m} - y| w(x_{m}) w(y) \\&\quad - \frac {1}{2M^{2}\bar {w}^{2}}\sum_{m=1}^{M}\sum_{j=1}^{M} | x_{m} - x_{j} | w(x_{m}) w(x_{j}) w(y), \end{align*}

## 4. Making new kernel scores from old.

The previous section places existing weighted versions of the CRPS into the framework of kernel scores. Several operations exist under which the positive definiteness of a kernel is conserved [32], and this desirable property has also been studied in the case of negative definite and c.n.d. kernels; see, e.g., [4, 8]. In this section, we describe three operations on kernels that can be used to emphasize particular outcomes within the kernel score framework, and we clarify when the resulting weighted scoring rules are proper, localizing, and strictly or proportionally locally proper.

First, the twCRPS can be generalized to construct a class of threshold-weighted kernel scores by replacing the Euclidean distance with an arbitrary c.n.d. kernel.

As in the previous section, we refer to \(v\) as the chaining function.

Theorem 2.2 implies that if \(\rho\) is c.n.d. with \( \rho (x, x)=0\) for all \( x \in \mathcal{X}\), then the threshold-weighted kernel score with kernel \( \rho\) and chaining function \( v\) is proper with respect to \(\mathcal{M}_{\tilde{\rho }}\), where \( \tilde{\rho }(x, x^{\prime }) = \rho (v(x), v(x^{\prime }))\). If \(\rho\) is negative definite, then the score is proper with respect to \(\mathcal{M}^{\tilde{\rho }}\). If the kernel score \( S_{\rho }\) is strictly proper, then it is possible to characterize the chaining functions that preserve this strict propriety.

It could be argued that the strict propriety of a weighted scoring rule is not of primary concern, since our interest typically is only on a subset of possible outcomes, that is, the set \( \{ w \gt 0 \}\), given the chosen weight function. However, threshold-weighted kernel scores require the specification of a chaining function, which may or may not be associated with a weight function, or measure, and there is no canonical way to derive a chaining function that corresponds directly to a given weight. Nevertheless, if a certain weight function has been chosen, it is possible to characterize the chaining functions for which a threshold-weighted kernel score is localizing and strictly locally proper.

Hence, whether or not a threshold-weighted kernel score is localizing with respect to a given weight will depend on the choice of chaining function.

The requirement \( \rho (v(z), v(\cdot )) = \rho (v(z^{\prime }), v(\cdot ))\) for all \( z, z^{\prime } \in \{w=0\}\) can easily be satisfied by choosing a chaining function such that \( v(z) = v(z^{\prime }) = x_{0}\) for all \( z, z^{\prime } \in \{w=0\}\) and some \(x_0 \in \mathcal{X}\). For such a chaining function, we say that the threshold-weighted kernel score is centered at \( x_{0}\). Although it is not obvious how the choice of \( x_{0}\) will affect the score’s behavior in practice, it does not alter its theoretical properties, and for the applications in section 5 there is always a canonical choice of \(x_{0}\).

In contrast to threshold-weighted kernel scores, outcome-weighted scores can be generalized to arbitrary c.n.d. kernels while maintaining a direct connection to the weight function. In particular, similarly to the outcome-weighted CRPS in (3.6), we define outcome-weighted kernel scores as follows: they are a special case of the weighted scores proposed in [19].

As mentioned for the owCRPS, outcome-weighted kernel scores provide a means of weighting existing kernel scores, but they themselves do not fit into the kernel score framework. The results of [19] discussed in subsection 3.2 clarify when outcome-weighted kernel scores are proper, localizing, and proportionally locally proper, and how they can be modified to be strictly locally proper.

Finally, it is well known that if \( \rho\) is a negative definite kernel and \( w\) is a weight function, then \( \check{\rho }(x, x^{\prime }) = \rho (x, x^{\prime })w(x)w(x^{\prime })\) is also a negative definite kernel. We therefore propose constructing weighted scoring rules based on this weighted kernel, and we label such scores as vertically rescaled kernel scores.

Since \( \check{\rho }\) is itself a negative definite kernel and \( \mathcal{M}^{\rho } \subset \mathcal{M}^{\check{\rho }}\), it follows that the associated kernel score in (4.3) is proper with respect to \( \mathcal{M}^{\rho }\). Although multiplication by a nonnegative function preserves the negative definiteness of a kernel, if \( \rho\) is only c.n.d., it is not necessarily the case that \( \check{\rho }\) will be. However, if \( \rho\) is a c.n.d. kernel with \( \rho (x, x)=0\) for all \( x \in \mathcal{X}\), thenwill be negative definite for arbitrary \( x_{0} \in \mathcal{X}\) [4, Lemma 2.1]. Since \( \rho^{*}\) is negative definite, it follows that \( \rho^{*}(x, x^{\prime })w(x)w(x^{\prime })\) is also negative definite, and this kernel is used in (4.4) to construct the vertically rescaled kernel score associated with \(\rho\), with center \( x_{0}\). Proposition 20 in [37] can be used to show that this score is proper with respect to \(\mathcal{M}_\rho\).

\begin{align} \rho^{*}(x, x^{\prime }) = \rho (x, x^{\prime }) - \rho (x, x_{0}) - \rho (x^{\prime }, x_{0}) \end{align}

(4.5)

Regardless of whether \( \rho\) is negative definite, the unweighted kernel score is recovered by choosing \( w(z)=1\) and does not depend on \( x_{0}\). In general, however, the vertically rescaled kernel score will depend on \( x_{0}\). If a vertically rescaled kernel score is strictly proper with respect to a class of distributions that contains Dirac measures, then \(\{w=0\}\) can contain at most one element. Ensuring strict propriety of a vertically rescaled score requires stronger assumptions.

If \(-\rho\) is a bounded continuous strictly positive definite function on \(\mathbb{R}^d\) for any \(d \ge 1\), then the requirement in Proposition 4.8 is satisfied. However, if \(\rho\) is a c.n.d. kernel with \(\rho (x,x)=0\), \(x \in \mathcal{X}\), applying Proposition 4.8 to \(\rho^*\) in (4.5) is not always possible according to the limited literature on kernel embeddings with unbounded kernels.

It can be seen immediately from Definition 4.7 that vertically rescaled kernel scores depend on the forecast \( P\) only through its restriction to the set \( \{ w \gt 0 \}\), and these scores are therefore localizing. Slightly generalizing Proposition 4.8, we obtain the following result.

Vertically rescaled kernel scores fit into the kernel score framework. While threshold-weighted kernel scores deform the inputs of the kernel, vertically rescaled kernel scores weight the kernel’s output. However, for particular weight and chaining functions, vertically rescaled and threshold-weighted kernel scores are equivalent.

The three approaches discussed in this section for constructing weighted scoring rules apply to the entire class of kernel scores and are therefore applicable in a wide range of settings. As an example of this, in the following section, we investigate the application of kernel scores in a multivariate context and use results from this section to introduce weighted versions of popular multivariate scoring rules.

## 5. Weighted multivariate scoring rules.

### 5.1. Energy and variogram scores.

Let \(\mathcal{X} = \mathbb{R}^{d},\) and let \( \mathcal{M}\) denote the set of Borel probability measures on \( \mathbb{R}^{d}\). Forecast verification in a multivariate setting is significantly less developed than in the univariate case [15]. In particular, there are relatively few recognized scoring rules to quantify the accuracy of multivariate forecasts. The logarithmic score can be used to assess multivariate predictive densities, but multivariate forecasts commonly appear in the form of a finite ensemble. Hence, the logarithmic score, as well as weighted versions of this score, such as the conditional and censored likelihood scores proposed by [11], are often not applicable in practice. Instead, two alternative multivariate scoring rules are commonly preferred: the energy score and the variogram score.

The energy score is defined aswhere \( || \cdot ||\) is the Euclidean norm, and the exponent \( \beta \in (0, 2)\) is typically set to one [16]. Here and throughout this section, we assume that all relevant expectations are finite. Clearly, (5.1) defines a kernel score associated with the c.n.d. kernel \( \rho (x, x^{\prime }) = ||x - x^{\prime }||^{\beta }\), for \( x, x^{\prime } \in \mathbb{R}^{d}\), and the energy score thus generalizes the CRPS to multiple dimensions. The energy score is strictly proper with respect to \( \{ P \in \mathcal{M} : \mathbb{E}_{P}||X||^{\beta } \lt \infty \}\).

\begin{align} \textrm{ES}_{\beta }(P, y) = \mathbb{E}_{P}||X - y||^{\beta } - \frac{1}{2}\mathbb{E}_{P}||X - X^{\prime }||^{\beta }, \end{align}

(5.1)

The results presented in the previous section allow us to generate three distinct weighted energy scores. First, Definition 4.1 can be used to construct a threshold-weighted energy score,where \( v: \mathbb{R}^{d} \to \mathbb{R}^{d}\) is a chaining function. Alternatively, applying (4.2) to the energy score recovers the outcome-weighted energy score proposed in [19],where \(w:\mathbb{R}^{d} \to [0,1]\) is a weight function. Similarly, Definition 4.7 can be used to introduce a vertically rescaled energy score. Since the kernel used in the energy score is only c.n.d., this requires choosing a point \( x_{0} \in \mathbb{R}^{d}\) at which to center the weighted score. The natural choice is \(x_0=0\):While the results in section 4 provide conditions on the weight and chaining functions that ensure strict (local) propriety of the threshold-weighted and outcome-weighted energy scores, we can only guarantee propriety for the vertically rescaled energy score.

\begin{align*} \textrm {twES}_{\beta }(P, y; v) = \mathbb {E}_{P}||v(X) - v(y)||^{\beta } - \frac {1}{2}\mathbb {E}_{P}||v(X) - v(X^{\prime })||^{\beta }, \end{align*}

\begin{align*} \textrm {owES}_{\beta }(P, y; w) = \frac {1}{C_{w}(P)}\mathbb {E}_{P}[||X - y||^{\beta }w(X)w(y)] - \frac {1}{2C_{w}(P)^{2}}\mathbb {E}_{P}[||X - X^{\prime }||^{\beta }w(X)w(X^{\prime })w(y)], \end{align*}

\begin{align*} \begin{split} \textrm{vrES}_{\beta }(P, y; w) = &\,\, \mathbb{E}_{P}\left [ ||X - y||^{\beta } w(X)w(y) \right ] - \frac{1}{2}\mathbb{E}_{P} \left [ ||X - X^{\prime }||^{\beta } w(X)w(X^{\prime }) \right ] \\ & + (\mathbb{E}_{P} \left [ ||X - x_{0}||^{\beta } w(X) \right ] - ||y - x_{0}||^{\beta } w(y))(\mathbb{E}_{P} \left [ w(X) \right ] - w(y)). \end{split} \end{align*}

Although the energy score is possibly the most widely implemented multivariate scoring rule, some studies suggest that it is oversensitive to marginal forecast performance; see, e.g., [31]. The variogram score has been introduced as an alternative multivariate scoring rule [35]. Given an observation \( y = (y_{1}, \dots, y_{d}) \in \mathbb{R}^{d}\), the variogram score of order \( p \gt 0\) is defined aswhere \( X = (X_{1}, \dots, X_{d}) \sim P\), and \( h_{i, j} \in [0, 1]\) are nonnegative scaling parameters. The scaling parameters are used to control the emphasis given to pairs of dimensions, rather than particular outcomes, meaning they do not immediately provide a way to emphasize particular outcomes during forecast evaluation.

\begin{align} \textrm{VS}_{p}(P, y) = \sum_{i=1}^{d}\sum_{j=1}^{d} h_{i, j} (\mathbb{E}_{P}|X_{i} - X_{j}|^{p} - |y_{i} - y_{j}|^{p})^{2}, \end{align}

(5.2)

In contrast to the energy score, the variogram score is proper with respect to the set \( \{ P \in \mathcal{M} : \mathbb{E}_{P}|X_{i}|^{2p} \lt \infty \text{ for each } i=1, \dots, d \}\) but is not strictly proper [35]. The variogram score is also a kernel score, corresponding to the c.n.d. kernel,where \( x = (x_{1}, \dots, x_{d})\), \(x^{\prime } = (x_{1}^{\prime }, \dots, x_{d}^{\prime }) \in \mathbb{R}^{d}\). The variogram score was introduced as a multivariate scoring rule that is more sensitive to errors in the forecast’s dependence structure than the energy score and hence is itself an example of how the kernel within the kernel score framework can be chosen in order to emphasize particular features of the forecasts.

\begin{align*} \rho (x, x^{\prime }) = \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} (|x_{i} - x_{j}|^{p} - |x_{i}^{\prime } - x_{j}^{\prime }|^{p})^{2}, \end{align*}

In addition, just as we have introduced weighted versions of the energy score, weighted variogram scores can also be easily introduced. For example, the threshold-weighted variogram score of order \( p\) with chaining function \( v\) isSince the variogram score is not strictly proper, the outcome-weighted version of this score is localizing and proper but not necessarily proportionally locally proper.

\begin{align*} \textrm {twVS}_{p}(P, y; v) = \sum_{i=1}^{d}\sum_{j=1}^{d} h_{i, j} (\mathbb {E}_{P}|v(X)_{i} - v(X)_{j}|^{p} - |v(y)_{i} - v(y)_{j}|^{p})^{2}. \end{align*}

The energy score and variogram score are established kernel scores. The kernel score framework also permits the introduction of novel kernel scores by choosing an appropriate kernel. We illustrate this here by introducing a scoring rule based on the inverse multiquadric kernel [29, 36]. We define the inverse multiquadric score (IMS) aswhere \( X, X^{\prime } \sim P\) are independent.

\begin{align*} \mathrm {IMS}(P, y) = \mathbb {E}_{P} \left [ -(1 + ||X - y||^{2})^{-\frac {1}{2}} \right ] - \frac {1}{2} \mathbb {E}_{P} \left [ -(1 + ||X - X^{\prime }||^{2})^{-\frac {1}{2}} \right ] + \frac {1}{2}, \end{align*}

The IMS can be used to assess both univariate and multivariate forecasts, and weighted versions of this score can be constructed using the results presented in the previous section. The strict propriety of the IMS with respect to the entire class \(\mathcal{M}\) follows from the fact that the inverse multiquadric kernel is strictly positive definite and bounded. Boundedness of the kernel is in contrast to the kernels used within the CRPS, energy score, and variogram score. In particular, it implies strict local propriety of the vertically rescaled IMS. By comparing the IMS to these established scores, we can see to what extent these properties of the kernel affect the behavior of the resulting kernel score.

### 5.2. Weight and chaining functions.

We have argued that applying a weighted scoring rule is often synonymous with choosing a suitable kernel to employ within the kernel score framework. A natural question then arises regarding what weight or chaining function, and hence what kernel, to choose for a given problem. In this section, we consider the case where the outcome space is the \( d\)-dimensional Euclidean space, \( \mathbb{R}^{d}\), and discuss possible weight and chaining functions that could be used within the weighted multivariate scores introduced above in order to evaluate forecasts made for high-impact events.

In the univariate setting, it is common to assess forecasts with a weight function that only emphasizes values above (or below) a chosen threshold, e.g., \( w(z) = \unicode{x1D7D9}\{z \geq t\}\). In the multivariate case, this can be extended by considering weight functions that are one when a combination of the values along the different dimensions exceeds a threshold and are zero otherwise: \( w(z) = \unicode{x1D7D9}\{\sum_{i=1}^{d} b_{i} z_{i} \geq t\}\) for constants \( b_{1}, \dots, b_{d} \in \mathbb{R}\) and a threshold \( t \in \mathbb{R}\); see Figure 2a.

However, it may be the case that high-impact events arise from the interaction of several moderate events. For example, moderate rainfall over consecutive days is likely to result in flooding, despite the rainfall on each day not being extreme from a statistical perspective. Hence, one could also consider using a weight function that depends on whether a threshold is exceeded in every dimension, e.g., \( w(z) = \unicode{x1D7D9}\{z_{1} \geq t_{1}, \dots, z_{d} \geq t_{d}\}\) with \( t_{1}, \dots, t_{d} \in \mathbb{R}\). Such weight functions can be interpreted as indicator functions of orthants in Euclidean space; see Figure 2b.

The outcome-weighted and vertically rescaled scoring rules constructed using these weights based on indicator functions will not be strictly proper. The right-hand column of Figure 1 provides an example of a univariate weight function that is strictly positive, the Gaussian distribution function. In higher dimensions, a continuous multivariate distribution function can similarly be used; see Figure 2c.

If a threshold-weighted kernel score is to be implemented, it is necessary to specify a chaining function, rather than a weight, which is a less trivial task in the multivariate case. Proposition 4.4 states that the threshold-weighted kernel score will be localizing if the chaining function maps all points in \( \{ w=0 \}\) to a single value \( x_{0} \in \mathbb{R}^{d}\), while Proposition 4.5 states that if the chaining function is additionally injective on \( \{ w \gt 0\}\), then the score will be strictly locally proper. One option is to employ the functionfor some \(x_0 \in \mathbb{R}^d\). Such a chaining function maintains correspondence with the twCRPS when the weight is a one-sided interval; for example, when \( d=1\) and \( w(z) = \unicode{x1D7D9}\{z \geq t\}\), choosing \( x_{0} = t\) recovers the chaining function \( v(z) = \textrm{max}(z, t)\) presented in Figure 1.

\begin{align} v(z) = \begin{cases} z \text{ if } z \in \{ w \gt 0\}, \\ x_{0} \text{ if } z \in \{ w=0\} \end{cases} \end{align}

(5.3)

Of course, if the weight is not constant on \( \{ w \gt 0 \}\), then the chaining function at (5.3) will not be appropriate. The chaining function to choose in this case will depend strongly on what the weighting is designed to achieve. Since we are interested here in high-impact events, consider the case where the weight function is increasing along each dimension, for example, a multivariate distribution function. In the univariate case, if the weight is increasing, then the resulting chaining function is convex. One way to translate this into the multivariate setting is to use a chaining function that is convex along every dimension.

A possible chaining function that satisfies this is the integral of the weight function along each margin separately, conditional on the values along the other dimensions. For example, the final weight function considered in Figure 2 employs a multivariate Gaussian distribution function. Given a mean vector \( \mu = (\mu_{1}, \dots, \mu_{d})\) and a diagonal covariance matrix \( \Sigma\) with variances \( \sigma_{1}^{2}, \dots, \sigma_{d}^{2}\), integrating the conditional Gaussian distribution along each dimension yields a chaining function of the formwhich is essentially a componentwise extension of the chaining function presented in the final column of Figure 1. Moreover, since this chaining function is injective on \( \{w \gt 0\}\), Proposition 4.5 states that the resulting threshold-weighted kernel score will be strictly locally proper.

\begin{align*} v(z) &= \left ( (z_{1} - \mu_{1})\Phi \left (\frac{z_{1} - \mu_{1}}{\sigma_{1}}\right ) + \sigma_{1}\phi \left (\frac{z_{1} - \mu_{1}}{\sigma_{1}}\right ), \dots, (z_{d} - \mu_{d})\Phi \left (\frac{z_{d} - \mu_{d}}{\sigma_{d}}\right ) \right. \\ & \quad\left. +\kern1pt \sigma_{d}\phi \left (\frac{z_{d} - \mu_{d}}{\sigma_{d}}\right ) \right ), \end{align*}

These are only a few examples of possible weight and chaining functions that could be employed when evaluating forecasts and outcomes on \( \mathbb{R}^{d}\) while emphasizing high-impact events. Different choices of either function will generate a scoring rule that assesses different aspects of the forecast performance, and, in general, it is a task for a domain expert to choose the appropriate weight or chaining function in order to extract the relevant information for the problem at hand. In the remainder of this section, we examine how the weights and deformations presented above can be used within weighted multivariate scoring rules to assess forecasts made for high-impact events.

### 5.3. Simulation study.

#### 5.3.1. Outline.

In order to understand the properties of the various weighted multivariate scoring rules, we apply them to simulated forecasts and observations. The simulation study is organized as follows. The true distribution \(G\) and two forecast distributions \(F_{1}\) and \(F_{2}\) are given bywhere \( G\) is a standard multivariate Gaussian distribution, and \( H\) is a multivariate Student’s \( t\) distribution with four degrees of freedom [20]. As in [19], the mixing function \( a\) is a univariate Gaussian distribution function with zero mean and standard deviation equal to one half, which is evaluated at \( \sum_{i=1}^{d} z_{i}\).

\begin{align*} \begin{split} F_{1}(z) = a(z)G(z) + (1 - a(z))H(z), \quad z \in \mathbb{R}^{d}, \\ F_{2}(z) = (1 - a(z))G(z) + a(z)H(z), \quad z \in \mathbb{R}^{d}, \end{split} \end{align*}

In order to assess the competing forecasts, 100 ensemble members are sampled at random from \( F_{1}\) and \( F_{2}\), and both forecasts are then evaluated against 100 observations drawn from \( G\), via ensemble forecast representations of the various weighted scoring rules. A Diebold–Mariano test [10] at the nominal level 0.05 is applied to assess whether or not one forecast outperforms the other when assessed using each score. This process is repeated 1000 times, and the proportion of instances in which the null hypothesis of equal predictive performance is rejected in favor of each forecast distribution is recorded. The rejection rates for the various scoring rules can then be examined to understand the discriminative behavior of the different scores. Such a framework has been considered in several previous studies on weighted scoring rules [11, 24, 19].

In the univariate case, the forecasts are assessed using the CRPS and the IMS, while in the multivariate case, results are presented for the energy score, the variogram score, and the IMS. Threshold-weighted, outcome-weighted, and vertically rescaled versions of these kernel scores are all considered, as well as outcome-weighted scores that have been complemented with the Brier score in order to make them strictly locally proper, as discussed in subsection 3.2.

For concision, we consider only indicator-based weight functions, which are commonly applied in practice. Results are presented for the univariate weight function \( w(z) = \unicode{x1D7D9}\{ z \geq t \}\) and for two multivariate weight functions \( w(z) = \unicode{x1D7D9}\{\sum_{i=1}^{d} z_{i} \geq t\}\) and \( w(z) = \unicode{x1D7D9}\{ z_{1} \geq t, \dots,{} z_{d} \geq t \}\). In all cases, we study the results as the threshold \( t\) is changed.

Equation (5.3) is used to construct a chaining function that generates localizing threshold-weighted kernel scores. When \( w(z) = \unicode{x1D7D9}\{ z_{1} \geq t, \dots, z_{d} \geq t \}\), the threshold-weighted scores are centered at \( x_{0} = (t, \dots, t)\). When \( w(z) = \unicode{x1D7D9}\{\sum_{i=1}^{d} z_{i} \geq t\}\), the scores are centered at \( x_{0} = (t/d, \dots, t/d)\). Where relevant, the vertically rescaled kernel scores are centered at \( (0, \dots, 0)\). If the kernel is only c.n.d., this is equivalent to using a threshold-weighted kernel score centered at the origin (Proposition 4.10). Hence, comparing the performance of the threshold-weighted and vertically re-scaled scores allows us to assess how the scores depend on the point at which they are centered. For the threshold-weighted variogram score, however, the kernel is insensitive to whether the score is centered at \( (t, \dots, t)\) or \( (0, \dots, 0)\), and hence the results for the localizing threshold-weighted variogram score are equivalent to those for the vertically rescaled variogram score.

These scores are compared to threshold-weighted kernel scores that are nonlocalizing. In this case, the chaining function is chosen to resemble the chaining function in the univariate case. First, consider the weight function \( w(z) = \unicode{x1D7D9}\{z_{1} \geq t, \dots, z_{d} \geq t \}\). One possible chaining function in this case is \( v(z) = (\mathrm{max}(z_{1}, t), \dots, \mathrm{max}(z_{d}, t))\). This chaining function projects any point not contained in the orthant \(\{z_{1} \geq t, \dots, z_{d} \geq t \}\) onto its boundary while leaving the points in \( \{ w \gt 0 \}\) unchanged. Similarly, for the weight function \( w(z) = \unicode{x1D7D9}\{\sum_{i=1}^{d} z_{i} \geq t\}\), we consider the chaining function \( v(z) = (\mathrm{max}(z_{1}, z_{1} + l), \dots, \mathrm{max}(z_{d}, z_{d} + l))\), where \( l = (t - \sum_{i=1}^{d} z_{i})/d\).

#### 5.3.2. Results.

First, consider the univariate case. The forecast \( F_{1}\) is correctly specified in the upper tail but exhibits a heavy lower tail, whereas the opposite is true for \( F_{2}\). As such, since the weight function focuses on the upper tail of the forecast distributions, for large values of \( t\) the weighted scoring rules should reject the null hypothesis of equal predictive performance in favor of \( F_{1}\).

While the rejection rate of the unweighted CRPS and IMS is close to 0.025 for all thresholds, Figure 3 illustrates that the frequency of rejections in favor of \( F_{1}\) generally increases with the threshold when a weighted scoring rule is used to assess the forecasts. The outcome-weighted scores perform poorly for larger thresholds, since they are sensitive to the number of observations that exceed the threshold, which is a result also observed in [19]. Complementing the owCRPS and owIMS with the Brier score generates scoring rules that can better distinguish between the two forecasts.

The threshold-weighted scoring rules also perform well. These emphasize extreme events by mapping values below the threshold(s) of interest to the center \(x_{0}\). This makes the scores less informative when assessing overall forecast performance but means that errors in the tail of the forecast distributions have a larger contribution on the scores. Since the differences between the two forecast distributions are constructed to be in the tails, the rejection rate of the Diebold–Mariano tests for equal predictive performance increases for larger thresholds. Comparing the threshold-weighted CRPS to the vertically rescaled CRPS illustrates the sensitivity of the scores to the point at which the two weighted scores are centered; in this case, centering the scores at the threshold appears more beneficial than centering them at the origin.

The right-hand side of Figure 3 shows the proportion of rejections in favor of \( F_{2}\). As would be expected, and in contrast to \( F_{1}\), this rejection frequency is close to zero for large thresholds, and it increases towards 0.025 as the threshold becomes smaller, mirroring the results presented in [19]. The owIMS complemented with the Brier score, on the other hand, results in a large rejection frequency when the threshold is low.

Consider now results for the bivariate setting. In this case, \( F_{1}\) is close to the true data generating process \( G\) in the upper right quadrant (when the mixing function is close to one), whereas \( F_{2}\) is similar to the true distribution in the lower left quadrant, meaning the weighted scores should again reject the hypothesis of equal predictive performance in favor of \( F_{1}\) when the threshold \( t\) is large.

Figure 4 displays the rejection frequency in favor of \( F_{1}\) and \( F_{2}\) corresponding to each scoring rule for the weight function \( w(z) = \unicode{x1D7D9}\{z_{1} \geq t, z_{2} \geq t\}\). Results similar to those presented in the univariate case are observed. In particular, the energy score, variogram score, and the inverse multiquadric score all cannot distinguish between the two forecasts, whereas their weighted versions do, particularly when interest is focused on relatively large thresholds. The rejection rates corresponding to the outcome-weighted scores increase slightly with the threshold but then tend towards zero as the number of observations that exceed the threshold decreases. Complementing these scores with the Brier score again generates scoring rules that are capable of identifying the differences between the forecasts. The threshold-weighted scores are also adept at capturing the differences in forecast behavior as the threshold is increased. This is true for both the localizing and nonlocalizing variants, though the localizing score is generally more discriminative for large thresholds. For the energy score, the vertically rescaled score is again slightly less informative than the localizing threshold-weighted score, suggesting it is preferable to center these scores close to the threshold of interest.

Figure 5 displays the corresponding results for the weight function \( w(z) = \unicode{x1D7D9}\{z_{1} + z_{2} \geq t\}\). The weighted scoring rules in this case appear to be less discriminative than in the previous setting, though the conclusions are largely similar. The unweighted multivariate scores cannot distinguish between the two forecasts, whereas the weighted scores do so successfully. The exception to this is the nonlocalizing threshold-weighted scores, which are only marginally more discriminative than the unweighted scores, regardless of the threshold used within the weight function. The reason for this is that the chaining function in this case projects points in \( \{ w=0\}\) onto the line \( z_{1} + z_{2} = t\), which still contains a lot of information. The resulting score is thus dominated by differences between points in \( \{ w=0\}\).

These examples all consider the case where the forecast distribution \(F_{2}\) resembles a Student’s \(t\) distribution in the upper tail, whereas the observations are drawn from a standard normal distribution. That is, \(F_{2}\) is misspecified by having a tail that is too heavy. However, [24] remarks that, in practice, if the threshold in the weight function \(w(z) = \unicode{x1D7D9}\{z \geq t\}\) is so large that it exceeds the maximum of the observations, then the weighted scores will not depend on the observations. Instead, the scores will be determined by the probability that the forecast distributions assign to \(\{ w \gt 0\}\), with the lighter-tailed distribution receiving the lower score. We anticipate that this will be yet more prevalent in the multivariate case, since the number of observations that exceed the threshold in multiple dimensions will generally be even smaller.

To demonstrate this, suppose that the true distribution \(G\) is now a Student’s \(t\) distribution with four degrees of freedom, and \(F_{1}\) and \(F_{2}\) are as defined above. In this case, \(F_{1}\) is again correctly specified in the tail, whereas \(F_{2}\) has a tail that is too light. The rejection frequencies of the corresponding Diebold–Mariano tests are displayed for the univariate and bivariate cases in Figure 6. For concision, results are only presented for the CRPS in the univariate case, and the energy score in the multivariate case, though analogous results are observed for the other scoring rules. Similarly, only the multivariate weight function \( w(z) = \unicode{x1D7D9}\{z_{1} \geq t, \dots, z_{d} \geq t \}\) is considered.

Although the weighted scoring rules should prefer \(F_{1}\) to \(F_{2}\), the null hypothesis is more commonly rejected in favor of \(F_{2}\) for large thresholds. As discussed, this issue is stronger in the multivariate case, since even fewer observations exceed the threshold in both dimensions, meaning the weighted scores become dominated by the tail probabilities for more moderate thresholds. This seems to be less of an issue for the nonlocalizing threshold-weighted energy score, since it retains more information from the forecasts and observations below the threshold, rather than projecting them onto a single point (as in the localizing variant).

This simulation study therefore illustrates that the weighted scoring rules are capable of distinguishing between forecast distributions that differ in particular regions of the outcome space, both in the univariate and multivariate settings. Similar results are also observed for dimensions higher than \(d=2\), though these have been omitted for concision. The differences between the forecast distributions and the observation distribution become larger in higher dimensions, leading to a higher rejection rate in the multivariate case, despite the curse of dimensionality. However, when the events of interest are so rare that they are not present in the observed sample (due to either a small sample size or a very large threshold), then the weighted scoring rules will depend solely on the tail probabilities, with the lighter-tailed forecast receiving the lower score [24]. That is, in order to effectively compare forecasts with respect to high-impact events, the number of observations should be large enough that these high-impact events are observed within the sample. This requirement is intuitive but may not be satisfied in many practical applications, particularly when forecasting rare events in very high dimensions.

## 6. Case study.

### 6.1. Outline.

In this section, the weighted energy, variogram, and inverse multiquadric scores described previously are used to evaluate daily rainfall accumulation forecasts, with emphasis on events that could lead to flooding. Flooding and other associated impacts could manifest, for example, from a large precipitation accumulation on a single day, or from moderate rainfall over consecutive days. Changing the weight used within the weighted multivariate scoring rules allows us to consider these different possibilities when evaluating forecasts. Whether or not an impact occurs will depend not only on the amount of rainfall, but also on other factors, such as the temperature or a location’s capabilities to deal with large rainfall accumulations. These external factors are not considered in this study, though the weight or chaining function within the weighted scoring rules could be adjusted to include this information. For example, a weight function could be used that employs different parameters depending on certain covariates or location-specific characteristics.

The daily precipitation accumulation forecasts considered here were issued by the Swiss Federal Office of Meteorology and Climatology’s (MeteoSwiss) COSMO-E ensemble prediction system. The forecasts and corresponding observations are available at 245 weather stations across Switzerland, which are displayed in Figure 7, along with the average observed daily accumulation at each station over the period of interest comprising four autumn seasons (from September to November of each year) during 2016–2019. This results in roughly 100,000 pairs of forecasts and observations. The forecasts are initialized at 00 UTC, and their performance is analyzed over the three consecutive days following this initialization time.

It is common for these dynamical weather forecasts to undergo some form of statistical postprocessing. Since the physical mechanisms underlying flooding events may occur on timescales longer than one day, the COSMO-E forecasts are postprocessed at individual lead times and then combined using copula approaches to generate multivariate forecasts for the precipitation accumulation for the following three days. Four different copulas are considered.

An independence copula assumes that there is no dependence between the precipitation accumulation on successive days (conditional on the ensemble output), whereas a comonotonic copula conversely assumes perfect dependence, so that heavy rainfall on one day is followed by heavy rainfall on the next day. The third approach, ensemble copula coupling (ECC), utilizes an empirical copula based on the dependence structure observed in the COSMO-E ensemble forecast; it thus assumes that the numerical weather model correctly simulates the dependence structure observed in reality. ECC is well-established in the postprocessing literature and is commonly implemented in operational postprocessing suites [34]. Finally, we employ a Gaussian copula, which was found here to outperform alternative parametric copula families. Further details of the statistical postprocessing framework are presented in the supplementary material (supplementary_material.pdf [220KB]).

The postprocessing model considerably improves the calibration of the raw COSMO-E forecasts, both in the univariate and multivariate settings (see the supplementary material (supplementary_material.pdf [220KB])). In the remainder of this section, the marginal forecast performance at each lead time is evaluated using the CRPS and the univariate IMS, while the multivariate forecast distributions (over the three-day period) are assessed using the energy score, the variogram score and the IMS. Since interest is predominantly on forecasts made for events that could lead to flooding, threshold-weighted and vertically rescaled versions of these kernel scores are also employed, both in a univariate context for the individual daily accumulations, and in a multivariate context for the combined daily accumulations over three days.

### 6.2. Results.

The four multivariate postprocessed forecasts differ only in their choice of copula, and hence all exhibit the same univariate forecast performance. Table 1 displays the CRPS and IMS for these postprocessed forecasts, as well as the COSMO-E model output. Both scores suggest that statistical postprocessing is beneficial at all lead times, though the improvement decreases as lead time increases. Table 1 also presents the threshold-weighted and vertically rescaled versions of these two univariate scores, where the weight is an indicator function that emphasizes daily precipitation accumulations that exceed 50 mm, roughly corresponding to the 99th percentile of the observed accumulations in the training data set. The corresponding chaining function used within the threshold-weighted scores is \( v(z) = \mathrm{max}(z, 50)\), while the vertically rescaled scores, where relevant, are centered at the origin. Similarly to the unweighted scores, post-processing improves upon the raw ensemble forecast, regardless of the weighted score used to assess the forecasts.

Scores for the multivariate forecasts are presented in Table 2. The comonotonic copula approach performs worst with respect to all unweighted scores, whereas the COSMO-E ensemble outperforms the postprocessed forecasts when evaluated using the variogram score. Since the variogram score is more sensitive to the forecast dependence structure than the energy score, this result suggests that any improvements gained by postprocessing are principally due to an improved marginal performance. The independence copula, ECC, and Gaussian copula approaches all perform similarly, suggesting that the COSMO-E output already captures the majority of the dependence between the precipitation accumulations on successive days.

Results are also presented in Table 2 for evaluating forecasts using threshold-weighted and vertically rescaled versions of these multivariate scores. First, consider a weight function that is equal to one when the daily precipitation exceeds 25 mm on the three consecutive days—this is labeled a “successive exceedance.” As in the simulation study, two different chaining functions are used within the threshold-weighted energy and variogram scores, one of which results in a localizing weighted score, while the other does not.

The conclusions drawn from the two chaining functions are not the same. The raw ensemble output and the comonotonic copula approach perform worst according to the nonlocalizing threshold-weighted scores, regardless of whether the energy score, variogram score, or IMS is considered. However, when evaluated using a localizing threshold-weighted score, or a vertically rescaled score, these two approaches typically outperform the other postprocessing methods. This is particularly the case for the energy score, even though the raw ensemble and the comonotonic approaches perform worst with respect to the unweighted energy score.

This highlights that although one forecast strategy might result in the best overall forecast performance, this may not be the preferred approach when interest is focused on high-impact events. To reinforce this, Figure 8 displays the energy score against the (localizing) threshold-weighted energy score for the five forecasting approaches. By analyzing forecast performance with respect to multiple objectives, we can clearly see the trade-offs between the different approaches: the independence copula, ECC, and Gaussian copula generate forecasts that outperform the COSMO-E ensemble when evaluated using the energy score, but this comes at the expense of forecast accuracy when interest is focused on high-impact events.

Consider now employing a weight function in these weighted scores that is equal to one if the total precipitation over all three days exceeds 75 mm, and is equal to zero otherwise—this is labeled “total exceedance” in Table 2. In this case, the weight function acknowledges that flooding could result from moderate precipitation on consecutive days, or extreme precipitation on just a single day. Since the simulation study in the previous section suggested that the nonlocalizing variant of the threshold-weighted scores was not effective for this weight function, only the localizing versions have been applied. The independence copula, ECC, and Gaussian copula approaches generate the most accurate forecasts for these events, while the comonotonic copula and, in particular, the raw ensemble forecast, perform comparatively poorly. This weight function depends less on the multivariate dependence structure than the previous weight function, and hence the results are more similar to those for the case when an unweighted scoring rule is used to evaluate forecast performance.

## 7. Discussion.

By evaluating the quality of their predictions, forecasters gain essential information regarding how their forecasts perform and how they can be improved. In practice, there are typically certain outcomes that have a higher impact on the forecast users than others, and it is therefore particularly beneficial to issue accurate predictions for these high-impact events. However, methods for evaluating forecasts of high-impact events are scarce. While weighted scoring rules have become the standard approach to this, they have been studied almost exclusively in the univariate setting, ignoring the fact that an impact may occur as a result of several confounding features. To address this, we have introduced weighted multivariate scoring rules, which can emphasize particular multivariate outcomes of interest during forecast evaluation.

The weighted multivariate scoring rules developed here have been constructed by exploiting existing theory on conditionally negative definite kernels and the associated kernel score framework. It is shown that the well-known threshold-weighted continuous ranked probability score is a kernel score. This allows us to construct generalizations of the twCRPS that can be applied in a range of practical applications. While we demonstrate this by introducing weighted multivariate scoring rules, these generalizations could readily be applied in other settings, for example, when evaluating probabilistic forecasts of structured objects, including functions, graphs, and trees.

We explore these weighted scoring rules in the context of multivariate forecast evaluation and compare them to alternative weighted scores previously proposed by [19]. The two most popular scoring rules for assessing multivariate forecasts are the energy score and variogram score, both of which fall into the kernel score framework; additionally, we consider a new kernel score for both univariate and multivariate outcomes based on a bounded kernel. This inverse multiquadric score appears competitive with respect to discriminative ability and has the advantage of being strictly proper with respect to all probability measures on \(\mathbb{R}^d\). We introduce various weighted energy scores, variogram scores, and inverse multiquadric scores that can emphasize particular outcomes when evaluating multivariate forecasts, and use simulated data to demonstrate the utility of these weighted scores when distinguishing between forecasts made for high-impact events.

These weighted scoring rules are then applied to daily precipitation accumulation forecasts, with a particular focus on events that could lead to flooding. Multivariate statistical postprocessing methods are compared when forecasting these events, using both weighted and unweighted scores. Importantly, the conclusions drawn from the weighted scoring rules do not always coincide with those drawn from the unweighted scores, meaning the strategy that generates the most accurate forecasts is not necessarily optimal when the goal is to predict flooding events. The weighted scoring rules can discriminate well between forecast distributions that exhibit contrasting behavior in particular regions of the outcome space, something that unweighted scoring rules are not capable of. Code to reproduce these results can be found at https://github.com/sallen12/weighted_mv_scores, and many of the weighted scoring rules discussed herein are available in the widely used scoringRules package in R [22].

While this paper discusses scoring rules in the context of forecast verification, weighted scoring rules are also relevant for model validation and model selection. For example, several studies have designed probabilistic models to quantify the uncertainty inherent in complex physical processes, and proper scoring rules provide a convenient framework with which to compare competing models. Weighted scoring rules allow this to be achieved while emphasizing particular outcomes. For example, [3] employs a Gaussian process emulator to model the maximum height of tsunami waves, and weighted scoring rules could be used to assess the emulator’s accuracy when modeling large tsunami waves, which, of course, can result in severe societal impacts. Similarly, [7] uses a probabilistic model to emulate ice sheet coverage in Antarctica. In this case, weighted scoring rules would allow competing models to be compared when predicting sizeable changes in ice volume, while multivariate weighted scoring rules could additionally assess how well the models can predict simultaneously high rates of ice sheet melting in multiple spatial regions, for example.

These probabilistic models could also be trained using weighted scoring rules. The general approach of optimum score estimation [16] employs proper scoring rules as loss functions when estimating model parameters. If the model is correctly specified, then any strictly proper scoring rule should return the parameters corresponding to the data generating process. However, if this is not the case, then employing a weighted scoring rule as a loss function should result in a model that more accurately predicts the outcomes emphasized by the weight or chaining function, allowing particular outcomes to be targeted during model training. For weighted scoring rule estimation to be feasible in practice, the weighted scores should be available in closed form. Closed form representations are difficult to obtain for arbitrary scoring rules, weight functions, and forecast distributions, and this is even more prevalent in the multivariate case, where even unweighted scoring rules are rarely available in closed form. In the univariate case, however, closed form expressions can often be derived when familiar distributions and simple weight functions are employed; for example, an analytical formula for the twCRPS is available in [2] when the weight function is an indicator function and the forecast is a logistic distribution. That being said, weighted scoring rules based on indicator weight functions are generally not strictly proper.

The weighted scoring rules that we introduce are constructed using conditionally negative definite kernels. Choosing a kernel that is relevant for a particular problem provides an extremely flexible approach for evaluating forecast performance, as illustrated by the fact that the variogram score is itself a kernel score. To this end, we believe that kernel scores have been underappreciated in the field of forecast verification. Kernels are used widely in several areas of mathematics and machine learning, and kernel scores could be employed to evaluate predictions made in a wide range of circumstances, including those for which established forecast verification tools do not currently exist.

In this paper, kernel scores are used alongside fairly simple weight and chaining functions that allow practitioners to target high-impact events. Future applications of these scores could consider more elaborate weights that are especially suited to particular applications. It is difficult to provide general guidelines regarding which weight function, chaining function, and kernel score should be chosen for a given problem. As has been discussed, these choices will depend strongly on the application at hand and should generally be made by a domain expert. In [1], for example, probabilistic forecasts for heatwaves are evaluated using weighted multivariate scoring rules, with the weight function determined by an operational definition of a heatwave. Further discussion on the differences between the weighted scoring rules and their practical advantages is also available in [1].

While the simulation study presented in section 5 sheds some light on the strengths and limitations of the various weighted scoring rules proposed herein, an interesting avenue for future research is therefore to investigate the discriminative ability of the competing approaches, and to formally assess their relevance in different applications. In the multivariate case, it is also not clear how much the weighted scoring rules will be influenced by changes to the multivariate dependence structure, compared to changes in the marginal distributions. This is an open question for unweighted multivariate scoring rules, and it is not obvious how the results will differ for weighted variants. Nonetheless, we expect that a more comprehensive analysis of forecast performance for multivariate outcomes will also inspire new developments in forecasting methodology, with the goal of obtaining more accurate forecasts for high-impact events without deteriorating overall forecast accuracy.

## Appendix A. Proof of Theorem 2.2.

If \(\rho\) is measurable and negative definite, then the result is a restatement of [38, Theorem 1.1]. If \(\rho\) is a measurable c.n.d. kernel on \(\mathcal{X}\) and \(x_{0} \in \mathcal{X}\), then by [4, Lemma 2.1], the kernelis negative definite, and also measurable, since \( \rho\) is measurable. Assuming all the relevant integrals exist and are finite, it is a straightforward calculation to show that \(S_{k}(P, y) = S_{\rho }(P, y)\). By [38, Theorem 1.1], \(S_{k}\) is a proper scoring rule with respect to the class of all probability measures \( P \in \mathfrak{M}\) such that \( \mathbb{E}_{P}[ \sqrt{-k(X, X)}] \lt \infty\). Note that \( -k(x, x)\) is nonnegative since \( - k\) is positive definite. In particular, \( S_{k}\) is a proper scoring rule with respect to the smaller class of all probability measures \( P \in \mathfrak{M}\) such that \( \mathbb{E}_{P}[-k(X, X)] \lt \infty\).

\begin{align} k(x, x^{\prime }) = \rho (x, x^{\prime }) + \rho (x_{0}, x_{0}) - \rho (x, x_{0}) - \rho (x^{\prime }, x_{0}) \end{align}

(A.1)

We will now show that when \( \rho (x, x) = 0\) for all \( x \in \mathcal{X}\), this class of measures coincides with \( \mathcal{M}_{\rho }\). We have thatand hence \( \rho (x, x) = 0\) for all \( x \in \mathcal{X}\) implies that \( \rho\) is nonnegative and thus a semimetric. In this case, [37, Proposition 20] states that \( P \in \mathcal{M}_{\rho }\) if and only if \( \mathbb{E}_{P}[-k(X, X)] \lt \infty\). Although this proposition is stated for the specific situation where \( \mathcal{A}\) is the Borel \( \sigma\)-algebra with respect to a topology on \( \mathcal{X}\), this detail is not needed for the arguments in the proof, so the proposition holds for any measurable space \( (\mathcal{X}, \mathcal{A})\).

\begin{align*} k(x, x) = \rho (x, x) + \rho (x_{0}, x_{0}) - 2\rho (x, x_{0}), \end{align*}

It follows that all integrals in (2.2) are finite for \( P \in \mathcal{M}_{\rho }\). This is the case since, first, \( \mathbb{E}_{P}[\rho (X, x_{0})] \lt \infty\) for some \( x_{0} \in \mathcal{X}\) is equivalent to \( \mathbb{E}_{P}[\rho (X, x_{0})] \lt \infty\) for

*all*\(x_{0} \in \mathcal{X}\); second, (A.1) holds; and finally, \( -k(x, x^{\prime }) \leq \sqrt{-k(x, x)}\sqrt{-k(x^{\prime }, x^{\prime })}\).## Appendix B. Proof of Proposition 3.1.

Let \( \nu\) be a locally finite Borel measure on \( \mathbb{R}\), and let \( v: \mathbb{R} \to \mathbb{R}\) be a function such that \( v(x) - v(x^{\prime }) = \nu ([x^{\prime }, x))\) for any points \( x, x^{\prime } \in \mathbb{R}\). Then,If \( X, X^{\prime }\) are independent random variables distributed according to \(P, Q \in \mathcal{M}_{\rho },\) respectively, with associated distribution functions \( F\) and \( G\), then by the Fubini–Tonelli theorem,From this, it is straightforward to verify that if \( X, X^{\prime } \sim F\) are independent, thenSince \( \rho (x, x^{\prime })\) is a c.n.d. kernel, this proves that the threshold-weighted CRPS is a kernel score.

\begin{align*} \rho (x, x^{\prime }) = \vert v(x) - v(x^{\prime }) \vert = \int_{\mathbb {R}} \left (\unicode {x1D7D9}\{x^{\prime }\leq u \lt x\} + \unicode {x1D7D9}\{x\leq u \lt x^{\prime }\}\right ) \: \mathrm {d} \nu (u). \end{align*}

\begin{align*} \begin{split} \mathbb{E}_{F,G} \vert v(X) - v(X^{\prime }) \vert & = \int_{\mathbb{R}} \left ( \mathbb{E}_{F,G}\left [ \unicode{x1D7D9}\{X^{\prime } \leq u \lt X\} \right ] + \mathbb{E}_{F,G}\left [ \unicode{x1D7D9}\{X \leq u \lt X^{\prime } \}\right ] \right ) \: \mathrm{d} \nu (u) \\ & = \int_{\mathbb{R}} G(u) \left [ 1 - F(u) \right ] + F(u) \left [ 1 - G(u) \right ] \: \mathrm{d} \nu (u). \\ \end{split} \end{align*}

\begin{align*} \mathbb {E}_{F} \vert v(X) - v(y) \vert - \frac {1}{2} \mathbb {E}_{F} \vert v(X) - v(X^{\prime }) \vert = \int_{\mathbb {R}} \left (F(u) - \unicode {x1D7D9}\{y \leq u\} \right )^2 \: \mathrm {d} \nu (u) = \textrm {twCRPS}(F, y; \nu ). \end{align*}

To complete the proof of Proposition 3.1, it remains to show that this representation of the threshold-weighted CRPS is equivalent to a quantile scoring rule integrated over all \( \alpha \in (0, 1)\). The general form of a quantile scoring rule isfor some increasing function \( v\), where \( F^{-1}\) is the generalized inverse of \( F\). Integrating this with respect to \( \alpha\) givesThe first term is equivalent to \( \mathbb{E}_{F}|v(X) - v(y)|\) with \( X \sim F\), while the latter can be rewritten aswhere \( X,X^{\prime } \sim F\) are independent. Rearranging gives \( I_{2} = \mathbb{E}_{F} |v(X) - v(X^{\prime })|/2\), which in turn yieldsas desired.

\begin{align*} QS_{v, \alpha }(F, y) = (\unicode {x1D7D9}\{F^{-1}(\alpha ) \geq y\} - \alpha )(v(F^{-1}(\alpha )) - v(y)) \end{align*}

\begin{align*} \begin{split} &2 \int_{(0, 1)} (\unicode{x1D7D9}\{F^{-1}(\alpha ) \geq y\} - \alpha )(v(F^{-1}(\alpha )) - v(y)) \: \mathrm{d} \alpha \\ & \!\quad = \int_{(0, 1)} (2\unicode{x1D7D9}\{F^{-1}(\alpha ) \geq y\} - 1)(v(F^{-1}(\alpha )) - v(y)) \: \mathrm{d} \alpha - \int_{(0, 1)} (2\alpha - 1)(v(F^{-1}(\alpha )) - v(y)) \: \mathrm{d} \alpha \\ & \!\quad = \int_{(0, 1)} |v(F^{-1}(\alpha )) - v(y)| \: \mathrm{d} \alpha - \int_{\mathbb{R}} (2F(x) - 1)(v(x) - v(y)) \: \mathrm{d} F(x), \\ & \!\quad = \int_{\mathbb{R}} |v(x) - v(y)| \: \mathrm{d} F(x) - \int_{\mathbb{R}} \int_{\mathbb{R}} (2\unicode{x1D7D9}\{x^{\prime } \leq x\} - 1)(v(x) - v(y)) \: \mathrm{d} F(x^{\prime })\: \mathrm{d} F(x). \end{split} \end{align*}

\begin{align*} \begin{split} I_{2} & = \mathbb{E}_{F}\left [ (2 \unicode{x1D7D9}\{X^{\prime } \leq X \} - 1)(v(X) - v(y)) \right ] \\ & = \mathbb{E}_{F}\left [ (2 \unicode{x1D7D9}\{X^{\prime } \leq X \} - 1)(v(X) - v(X^{\prime })) \right ] + \mathbb{E}_{F}\left [ (2 \unicode{x1D7D9}\{X^{\prime } \leq X \} - 1)(v(X^{\prime }) - v(y)) \right ] \\ & = \mathbb{E}_{F} |v(X) - v(X^{\prime })| - \mathbb{E}_{F} \left [ (2 \unicode{x1D7D9}\{X \lt X^{\prime }\} - 1)(v(X^{\prime }) - v(y)) \right ] \\ & = \mathbb{E}_{F} |v(X) - v(X^{\prime })| - I_{2}, \end{split} \end{align*}

\begin{align*} 2 \int_{(0, 1)} (\unicode {x1D7D9}\{F^{-1}(\alpha ) \geq y\} - \alpha )(v(F^{-1}(\alpha )) - v(y)) \: \mathrm {d} \alpha = \mathbb {E}_{F}|v(X) - v(y)| - \frac {1}{2} \mathbb {E}_{F} |v(X) - v(X^{\prime })|, \end{align*}

## Appendix C. Proof of Proposition 3.2.

Let \( \mathcal{M}\) denote the set of Borel probability measures on \( \mathcal{X} = \mathbb{R}\) with finite first moment. Let \( F \in \mathcal{M}\) such that \( \mathbb{E}_{F}[w(X)] \gt 0\), and define \( F_{w}\) as in (3.4). Since \( F \in \mathcal{M}\), we also have that \( F_{w} \in \mathcal{M}\). Then, the outcome-weighted CRPS with weight function \( w\) can be written aswhere \( Z, Z^{\prime } \sim F_{w}\) and \( X, X^{\prime } \sim F\) are independent.

\begin{align*} \begin{split} \mathrm{owCRPS}(F, y; w) =\,\, & w(y)\mathrm{CRPS}(F_{w}, y) = \mathbb{E}_{F_{w}} |Z - y| w(y) - \frac{1}{2} \mathbb{E}_{F_{w}} |Z - Z^{\prime }| w(y) \\ =\,\, & \frac{1}{C_{w}(F)} \mathbb{E}_{F} \left [ |X - y|w(X)w(y) \right ] - \frac{1}{2C_{w}(F)^{2}} \mathbb{E}_{F} \left [ |X - X^{\prime }|w(X)w(X^{\prime })w(y) \right ], \end{split} \end{align*}

## Appendix D. Proof of Proposition 4.3.

Let \( \rho\) be a c.n.d. kernel on \( \mathcal{X}\), let \( v: \mathcal{X} \to \mathcal{X}\) be a measurable function, and let \( \tilde{\rho }(x, x^{\prime }) = \rho (v(x), v(x^{\prime }))\). Suppose that \( S_{\rho }\) is a strictly proper scoring rule with respect to \( \mathcal{M}_{\rho }\).

First, assume that \( v\) is injective. Let \( \tilde{P}, \tilde{Q} \in \mathcal{M}_{\tilde{\rho }}\). We wish to show that \( d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) = 0\) implies \( \tilde{P} = \tilde{Q}\). To do so, define \( P\) and \( Q\) as the push-forward of \( v\) under \( \tilde{P}\) and \( \tilde{Q}\), respectively, i.e., \( P(A) = \tilde{P}(v^{-1}(A))\) and \( Q(A) = \tilde{Q}(v^{-1}(A))\) for all \( A \in \mathcal{A}\). Note that, since \( v\) is injective, it is also bimeasurable. Additionally, for \( \tilde{P} \in \mathcal{M}_{\tilde{\rho }}\), we have \( \mathbb{E}_{\tilde{P}}[\rho (v(X), v(x_{0})] \lt \infty\) for some \( x_{0} \in \mathcal{X}\). Letting \( x_{1} = v(x_{0})\),and hence \( P, Q \in \mathcal{M}_{\rho }\).

\begin{align*} \mathbb {E}_{\tilde {P}}[\rho (v(X), v(x_{0}))] = \mathbb {E}_{P}[\rho (X, x_{1})] \lt \infty, \end{align*}

The divergence function of the kernel score associated with \( \tilde{\rho }\) can then be written aswhere \( X, X^{\prime } \sim P\) and \( Y, Y^{\prime } \sim Q\) are independent. Since \( S_{\rho }\) is strictly proper with respect to \( \mathcal{M}_{\rho }\), \( d_{\rho }(P, Q) = 0\) implies \( P = Q\). Then, for any \( A \in \mathcal{A}\), we have thatwhere \( B = v(A) \in \mathcal{A}\), and \( v^{-1}(v(A)) = A\) holds for all \( A \in \mathcal{A}\) since \( v\) is injective. Hence, \( d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) = 0\) implies that \( \tilde{P} = \tilde{Q}\).

\begin{align*} \begin{split} d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) & = \mathbb{E}_{P, Q} \left [ \rho (X, Y) \right ] - \frac{1}{2} \mathbb{E}_{P} \left [ \rho (X, X^{\prime }) \right ] - \frac{1}{2} \mathbb{E}_{Q} \left [ \rho (Y, Y^{\prime }) \right ] = d_{\rho }(P, Q), \end{split} \end{align*}

\begin{align*} \tilde {P}(A) = \tilde {P}(v^{-1}(v(A))) = P(B) = Q(B) = \tilde {Q}(v^{-1}(v(A))) = \tilde {Q}(A), \end{align*}

Conversely, assume that \(\mathrm{tw}S_{\rho }\) is strictly proper with respect to \( \mathcal{M}_{\tilde{\rho }}\). If \( v\) is not injective, then there exist distinct \( z, z^{\prime } \in \mathcal{X}\) such that \( v(z) = v(z^{\prime })\). Letting \( \tilde{P} = \delta_{z}\) and \( \tilde{Q} = \delta_{z^{\prime }}\) be Dirac measures at \( z\) and \( z^{\prime }\), respectively, we haveThat is, there exist distinct \( \tilde{P}, \tilde{Q} \in \mathcal{M}_{\tilde{\rho }}\) such that \( d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) = 0\), which contradicts the assumption that \(\mathrm{tw}S_{\rho }\) is strictly proper with respect to \( \mathcal{M}_{\tilde{\rho }}\). Hence, if the kernel score associated with \( \rho\) is strictly proper with respect to \( \mathcal{M}_{\rho }\), then the kernel score associated with \( \tilde{\rho }\) is strictly proper with respect to \(\mathcal{M}_{\tilde{\rho }}\) if and only if the chaining function \( v\) is injective. Similar arguments can be used in the case when \( S_{\rho }\) is strictly proper with respect to \( \mathcal{M}^{\rho }\).

\begin{align*} \begin{split} d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) & = \rho (v(z), v(z^{\prime })) - \frac{1}{2} \rho (v(z), v(z)) - \frac{1}{2} \rho (v(z^{\prime }), v(z^{\prime })), \\ & = \rho (v(z), v(z)) - \frac{1}{2} \rho (v(z), v(z)) - \frac{1}{2} \rho (v(z), v(z)) = 0. \end{split} \end{align*}

## Appendix E. Proof of Proposition 4.4.

Let \( \rho\) be a c.n.d. kernel on \( \mathcal{X}\) such that \( \rho (x, x) = 0\) for all \( x \in \mathcal{X}\), let \( w\) be a weight function, and let \( v: \mathcal{X} \to \mathcal{X}\) be measurable. Set \( \tilde{\rho }(x, x^{\prime }) = \rho (v(x), v(x^{\prime }))\). For \( P \in \mathcal{M}_{\tilde{\rho }}\), define \(P_{0} = P(\cdot \cap \{w = 0\})\) and \(P_{+} = P(\cdot \cap \{w \gt 0\})\), where \( \{w \gt 0\} = \{x \in \mathcal{X} | w(x) \gt 0\}\) and \( \{w = 0\} = \{x \in \mathcal{X} | w(x) = 0\}\). Since \( \mathcal{X}\) can be partitioned into the union of \( \{ w \gt 0 \}\) and \( \{ w = 0 \}\), the threshold-weighted kernel score with kernel \( \rho\) and chaining function \( v\) can be decomposed as

\begin{align} \begin{split} \mathrm{tw}S_{\rho }(P, y; v) = & \int_{\mathcal{X}} \rho (v(x), v(y)) \: \mathrm{d} P_{0}(x) - \frac{1}{2} \int_{\mathcal{X}} \int_{\mathcal{X}} \rho (v(x), v(x^{\prime })) \: \mathrm{d} P_{0}(x) \: \mathrm{d} P_{0}(x^{\prime }) \\ & + \int_{\mathcal{X}} \rho (v(x), v(y)) \: \mathrm{d} P_{+}(x) - \frac{1}{2} \int_{\mathcal{X}} \int_{\mathcal{X}} \rho (v(x), v(x^{\prime })) \: \mathrm{d} P_{+}(x) \: \mathrm{d} P_{+}(x^{\prime }) \\ & - \int_{\mathcal{X}} \int_{\mathcal{X}} \rho (v(x), v(x^{\prime })) \: \mathrm{d} P_{0}(x) \: \mathrm{d} P_{+}(x^{\prime }). \end{split} \end{align}

(E.1)

Suppose that \( \rho (v(z), v(\cdot )) = \rho (v(z^{\prime }), v(\cdot ))\) for all \( z, z^{\prime } \in \{w = 0\}\). This implies that \( \rho (v(z), v(z^{\prime })) = 0\) for all \( z, z^{\prime } \in \{w = 0\}\). In this case, \( \mathrm{tw}S_{\rho }(P, y; v)\) simplifies towhere \( x_{0}\) is an arbitrary point in \( \{w = 0\}\). Since any \( P, Q \in \mathcal{M}_{\tilde{\rho }}\) are probability measures, if \( P_{+} = Q_{+}\), then \(P(\{w = 0\}) = Q(\{w = 0\}),\) and it becomes clear that \( \mathrm{tw}S_{\rho }(P, y; v) = \mathrm{tw}S_{\rho }(Q, y; v)\) for all \( y \in \mathcal{X}\). Hence, the threshold-weighted kernel score with such a kernel and chaining function is localizing with respect to \( w\).

\begin{align*} \begin{split} \mathrm{tw}S_{\rho }(P, y; v) = & \rho (v(x_{0}), v(y)) P(\{w = 0\}) + \int_{\mathcal{X}} \rho (v(x), v(y)) \: \mathrm{d} P_{+}(x) \\ & - \frac{1}{2} \int_{\mathcal{X}} \int_{\mathcal{X}} \rho (v(x), v(x^{\prime })) \: \mathrm{d} P_{+}(x) \: \mathrm{d} P_{+}(x^{\prime }) - P(\{w = 0\}) \int_{\mathcal{X}} \rho (v(x), v(x_{0})) \: \mathrm{d} P_{+}(x), \end{split} \end{align*}

Suppose now that \( \mathrm{tw}S_{\rho }\) is localizing with respect to \( w\). Consider \( P = \delta_{z}/2 + \delta_{x}/2\) and \(Q = \delta_{z^{\prime }}/2 + \delta_{x}/2\), where \( x \in \{w \gt 0 \}\) and \( z, z^{\prime } \in \{w = 0\}\). Note that \( P, Q \in \mathcal{M}_{\tilde{\rho }}\) and \( P(\cdot \cap \{w \gt 0\}) = Q(\cdot \cap \{w \gt 0 \})\). Since \( \mathrm{tw}S_{\rho }\) is localizing with respect to \( w\), we have that \( \mathrm{tw}S_{\rho }(P, y; v) - \mathrm{tw}S_{\rho }(Q, y; v) = 0\) for all \( y \in \mathcal{X}\). From (E.1), this means thatfor all \( z, z^{\prime } \in \{w = 0\}\), \( x \in \{w \gt 0\}\), and \( y \in \mathcal{X}\).

\begin{align} \mathrm{tw}S_{\rho }(P, y; v) - \mathrm{tw}S_{\rho }(Q, y; v) & = \frac{1}{2} \rho (v(z), v(y)) - \frac{1}{4} \rho (v(z), v(x)) - \frac{1}{2} \rho (v(z^{\prime }), v(y)) \\ \nonumber &\quad + \frac{1}{4} \rho (v(z^{\prime }), v(x)) = 0 \end{align}

(E.2)

Since this holds for all \( y \in \mathcal{X}\), we can substitute \( y = x\) to yield \( \rho (v(z), v(x)) = \rho (v(z^{\prime }), v(x))\) for all \( x \in \{w \gt 0\}\), \( z, z^{\prime } \in \{w = 0\}\). Using this, and substituting \( y = z^{\prime }\) into (E.2), we additionally have \( \rho (v(z), v(z^{\prime })) = 0\) for all \( z, z^{\prime } \in \{w = 0 \}\). Hence, if \( \mathrm{tw}S_{\rho }\) is localizing with respect to \( w\), then \( \rho (v(z), v(x)) = \rho (v(z^{\prime }), v(x))\) for all \( z, z^{\prime } \in \{w = 0 \}\) and \( x \in \mathcal{X}\).

## Appendix F. Proof of Proposition 4.5.

Let \( \rho\) be a c.n.d. kernel, let \( v: \mathcal{X} \to \mathcal{X}\) be measurable, and define \( \tilde{\rho }(x, x^{\prime }) = \rho (v(x), v(x^{\prime }))\). Suppose that \( S_{\rho }\) is strictly proper with respect to \( \mathcal{M}_{\rho }\). First, assume that the restriction of \( v\) to \( \{w \gt 0\}\) is injective; that is, \( v(z) \neq v(z^{\prime })\) for all \( z, z^{\prime } \in \{w \gt 0\}\). Let \( \tilde{P}, \tilde{Q} \in \mathcal{M}_{\tilde{\rho }}\). We wish to show that \( d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) = 0\) implies \( \tilde{P}(\cdot \cap \{w \gt 0\}) = \tilde{Q}(\cdot \cap \{w \gt 0\})\). Define \( P\) and \( Q\) as the push-forward of \( v\) under \( \tilde{P}\) and \( \tilde{Q}\), respectively, i.e., \( P(A) = \tilde{P}(v^{-1}(A))\) and \( Q(A) = \tilde{Q}(v^{-1}(A))\) for all \( A \in \mathcal{A}\). From the proof of Proposition 4.3, we know that \( P, Q \in \mathcal{M}_{\rho }\). Also from this proof, we have that \( d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) = 0\) implies that \( P = Q\). Then, for any \( A \in \mathcal{A}\), we have thatwhere \( B = v(A \cap \{w \gt 0\}) \in \mathcal{A}\), and \( v^{-1}(v(A \cap \{w \gt 0\})) = A \cap \{w \gt 0\}\) for all \( A \in \mathcal{A}\) since the restriction of \( v\) to \( \{ w \gt 0\}\) is injective. Hence, \( d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) = 0\) implies that \( P = Q\), which in turn implies that \( \tilde{P}(\cdot \cap \{w \gt 0\}) = \tilde{Q}(\cdot \cap \{w \gt 0\})\). The threshold-weighted kernel score with this loss function is therefore strictly locally proper with respect to \( w\) and \( \mathcal{M}_{\tilde{\rho }}\).

\begin{align*} \tilde{P}(A \cap \{w \gt 0\}) &= \tilde{P}(v^{-1}(v(A \cap \{w \gt 0\}))) = P(B) = Q(B) = \tilde{Q}(v^{-1}(v(A \cap \{w \gt 0\}))) \\ &= \tilde{Q}(A \cap \{w \gt 0\}), \end{align*}

Conversely, assume that \( \mathrm{tw}S_{\rho }\) is strictly locally proper with respect to \( w\) and \( \mathcal{M}_{\tilde{\rho }}\). If the restriction of \( v\) to \( \{w \gt 0\}\) is not injective, then there exist distinct \( z, z^{\prime } \in \{w \gt 0\}\) such that \( v(z) = v(z^{\prime })\). If \( \tilde{P} = \delta_{z}\) and \( \tilde{Q} = \delta_{z^{\prime }}\) are Dirac measures at \( z\) and \( z^{\prime }\), respectively, then \( d_{\tilde{\rho }}(\tilde{P}, \tilde{Q}) = 0\), even though \( \tilde{P}(\cdot \cap \{w \gt 0\}) = \tilde{Q}(\cdot \cap \{w \gt 0 \})\). This is a contradiction, and hence if \( \mathrm{tw}S_{\rho }\) is strictly locally proper with respect to \( w\) and \( \mathcal{M}_{\tilde{\rho }}\), then the restriction of \( v\) to \( \{ w \gt 0 \}\) must be injective. Similar arguments can be used in the case when \( \mathrm{tw}S_{\rho }\) is strictly proper with respect to \( \mathcal{M}^{\tilde{\rho }}\).

## Appendix G. Proof of Propositions 4.8 and 4.9.

Let \(-\rho\) be strictly integrally positive definite with respect to the maximal possible set of signed measures on \(\{w \gt 0\}\) in the sense of [38, Definition 2.1], let \( w\gt 0\) be a weight function, and let \( \check{\rho }(x, x^{\prime }) = \rho (x, x^{\prime })w(x)w(x^{\prime })\).

Let \(P, Q \in \mathcal{M}^{\rho },\) and let \(\tilde P, \tilde Q\) be the measures on \(\{w \gt 0\}\) that are absolutely continuous with respect to \(P(\cdot \cap \{w \gt 0\})\) and \(Q(\cdot \cap \{w \gt 0\})\), respectively, and density \(w\). Note that \(\tilde P, \tilde Q\) are finite measures on \(\{w \gt 0\}\) but not necessarily probability measures. For the divergence function corresponding to \( \mathrm{vr}S_{\rho }(\cdot, \cdot ; w)\), we obtain thatimplies \(\tilde P = \tilde Q\) due to \(-\rho\) being integrally strictly positive definite on \(\{w \gt 0\}\). This yields \(P(\cdot \cap \{w \gt 0\}) = Q(\cdot \cap \{w \gt 0\})\).

\begin{align*} 0= d_{\check{\rho }}(P, Q) &= \int_{\{w \gt 0\}} \int_{\{w \gt 0\}} \rho (x,y) \: \mathrm{d} \tilde P(x)\: \mathrm{d}\tilde Q(y) - \frac{1}{2} \int_{\{w \gt 0\}} \int_{\{w \gt 0\}} \rho (x,x^{\prime}) \: \mathrm{d} \tilde P(x)\: \mathrm{d}\tilde P(x^{\prime}) \\& \qquad - \frac{1}{2} \int_{\{w \gt 0\}} \int_{\{w \gt 0\}} \rho (y,y^{\prime}) \: \mathrm{d} \tilde Q(y)\: \mathrm{d}\tilde Q(y^{\prime}) \end{align*}

## Appendix H. Proof of Proposition 4.10.

Let \( \rho\) be a c.n.d. kernel on \( \mathcal{X}\) with \( \rho (x, x) = 0\) for all \( x \in \mathcal{X}\), and let \( w\) be a weight function such that \( w(x) \in \{0, 1\}\) for all \( x \in \mathcal{X}\). Consider the chaining function \(v(x) = xw(x) + x_{0}(1 - w(x))\) for \( x, x_{0} \in \mathcal{X}\).

For this chaining function, we have that

\begin{align*} \rho (v(x), v(x^{\prime })) = \rho (x, x^{\prime })w(x)w(x^{\prime }) + \rho (x, x_{0})w(x)(1 - w(x^{\prime })) + \rho (x^{\prime }, x_{0})w(x^{\prime })(1 - w(x)). \end{align*}

Let \( P \in \mathcal{M}_{\rho }\). Assuming all expectations are finite, substituting the above kernel into (4.1) giveswhich is the vertically rescaled kernel score with kernel \( \rho\), weight \( w\), and center \( x_{0}\), as given in (4.4).

\begin{align*} \begin{split} \mathrm{tw}S_{\rho }(P, y; v) &=\,\, \mathbb{E}_{P} \left [ \rho (X, y) w(X)w(y) \right ] - \frac{1}{2} \mathbb{E}_{P} \left [\rho (X, X^{\prime }) w(X) w(X^{\prime }) \right ] \\ &\quad + ( \mathbb{E}_{P} \left [ \rho (X, x_{0})w(X) \right ] - \rho (y, x_{0})w(y) )(\mathbb{E}_{P}[w(X)] - w(y)), \end{split} \end{align*}

## Acknowledgments.

The authors are extremely grateful to Olivia Martius, Pascal Horton, Jonas Bhend, Lionel Moret, Mark Liniger, and José Carlos Araujo Acuña for fruitful discussions during the preparation of this manuscript. The authors also thank MeteoSwiss and Jonas Bhend for providing the data used in the case study.

## Supplementary Materials

**PLEASE NOTE: These supplementary files have not been peer-reviewed.**

**Index of Supplementary Materials**

**Title of paper:**

*Evaluating Forecasts for High-impact Events Using Transformed Kernel Scores*

**Authors:**

*Sam Allen, David Ginsbourger, and Johanna Ziegel*

**File:**supplementary_material.pdf

**Type:**PDF

**Contents:**A thorough description of the statistical post-processing methods that are implemented in the case study of the paper, as well as an assessment of the calibration of the competing forecasts. Although this information supports the discussion in the paper, it is not the paper's key focus, and it has thus been relegated to supplementary material for concision.

- Download
- 219.60 KB

## References

1.

S. Allen, J. Bhend, O. Martius, and J. Ziegel, Weighted verification tools to evaluate univariate and multivariate probabilistic forecasts for high-impact weather events,

*Wea. Forecasting*, 38 (2023), pp. 499–516.2.

S. Allen, G. R. Evans, P. Buchanan, and F. Kwasniok, Incorporating the North Atlantic Oscillation into the post-processing of MOGREPS-G wind speed forecasts,

*Q. J. R. Meteorol. Soc.*, 147 (2021), pp. 1403–1418.3.

J. Beck and S. Guillas, Sequential design with mutual information for computer experiments (MICE): Emulation of a tsunami model,

*SIAM/ASA J. Uncertain. Quantif.*, 4 (2016), pp. 739–766, https://doi.org/10.1137/140989613.4.

C. Berg, J. P. R. Christensen, and P. Ressel,

*Harmonic Analysis on Semigroups*, Springer, New York, 1984.5.

J. R. Brehmer and K. Strokorb, Why scoring functions cannot assess tail properties,

*Electron. J. Stat.*, 13 (2019), pp. 4015–4034.6.

G. W. Brier, Verification of forecasts expressed in terms of probability,

*Mon. Weather Rev.*, 78 (1950), pp. 1–3.7.

W. Chang, M. Haran, P. Applegate, and D. Pollard, Calibrating an ice sheet model using high-dimensional binary spatial data,

*J. Amer. Stat. Assoc.*, 111 (2016), pp. 57–72.8.

J.-P. Chilès and P. Delfiner,

*Geostatistics: Modeling Spatial Uncertainty*, 2nd ed., Wiley Ser. Probab. Statist. 497, John Wiley & Sons, New York, 2012.9.

A. P. Dawid, The geometry of proper scoring rules,

*Ann. Inst. Stat. Math.*, 59 (2007), pp. 77–93.10.

F. X. Diebold and R. S. Mariano, Comparing predictive accuracy,

*J. Bus. Econ. Stat.*, 13 (1995), pp. 253–263.11.

C. Diks, V. Panchenko, and D. Van Dijk, Likelihood-based scoring rules for comparing density forecasts in tails,

*J. Econom.*, 163 (2011), pp. 215–230.12.

G. K. Dziugaite, D. M. Roy, and Z. Ghahramani, Training generative neural networks via maximum mean discrepancy optimization, in Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, UAI 2015, M. Meila and T. Heskes, eds., AUAI Press, Amsterdam, The Netherlands, 2015, pp. 258–267.

13.

T. Gneiting, Making and evaluating point forecasts,

*J. Amer. Stat. Assoc.*, 106 (2011), pp. 746–762.14.

T. Gneiting, F. Balabdaoui, and A. E. Raftery, Probabilistic forecasts, calibration and sharpness,

*J. R. Stat. Soc. Ser. B*, 69 (2007), pp. 243–268.15.

T. Gneiting and M. Katzfuss, Probabilistic forecasting,

*Annu. Rev. Stat. Appl.*, 1 (2014), pp. 125–151.16.

T. Gneiting and A. E. Raftery, Strictly proper scoring rules, prediction, and estimation,

*J. Amer. Stat. Assoc.*, 102 (2007), pp. 359–378.17.

T. Gneiting and R. Ranjan, Comparing density forecasts using threshold- and quantile-weighted scoring rules,

*J. Bus. Econ. Stat.*, 29 (2011), pp. 411–422.18.

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, A kernel statistical test of independence, in

*Advances in Neural Information Processing Systems 20*, MIT Press, Cambridge, MA, 2007, pp. 585–592.19.

H. Holzmann and B. Klar, Focusing on regions of interest in forecast evaluation,

*Ann. Appl. Stat.*, 11 (2017), pp. 2404–2431.20.

T. Hothorn, F. Bretz, and A. Genz, On multivariate \(t\) and Gauss probabilities in R,

*R News*, 1 (2001), pp. 27–29.21.

I. T. Jolliffe and D. B. Stephenson,

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*, John Wiley & Sons, Chichester, 2012.22.

A. Jordan, F. Krüger, and S. Lerch, Evaluating probabilistic forecasts with scoringRules,

*J. Stat. Softw.*, 90 (2019), pp. 1–37.23.

F. Laio and S. Tamea, Verification tools for probabilistic forecasts of continuous hydrological variables,

*Hydrol. Earth Syst. Sci.*, 11 (2007), pp. 1267–1277.24.

S. Lerch, T. L. Thorarinsdottir, F. Ravazzolo, and T. Gneiting, Forecaster’s dilemma: Extreme events and forecast evaluation,

*Stat. Sci.*, 32 (2017), pp. 106–127.25.

M. Leutbecher and T. N. Palmer, Ensemble forecasting,

*J. Comput. Phys.*, 227 (2008), pp. 3515–3539.26.

Y. Li, K. Swersky, and R. Zemel, Generative moment matching networks, in Proceedings of the 32nd International Conference on Machine Learning (Lille, France), F. Bach and D. Blei, eds., PMLR 37, 2015, pp. 1718–1727.

27.

R. Lyons, Distance covariance in metric spaces,

*Ann. Probab.*, 41 (2013), pp. 3284–3305.28.

J. E. Matheson and R. L. Winkler, Scoring rules for continuous probability distributions,

*Manage. Sci.*, 22 (1976), pp. 1087–1096.29.

C. A. Micchelli, Interpolation of scattered data: Distance matrices and conditionally positive definite functions, in

*Approximation Theory and Spline Functions*, Springer, New York, 1984, pp. 143–145.30.

K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf, Kernel mean embedding of distributions: A review and beyond,

*Found. Trends Mach. Learn.*, 10 (2017), pp. 1–141.31.

P. Pinson and J. Tastu,

*Discrimination Ability of the Energy Score*, Technical report, Technical University of Denmark, 2013.32.

C. E. Rasmussen and C. Williams,

*Gaussian Processes for Machine Learning*, MIT Press, Cambridge, MA, 2006.33.

S. Sankaran, H. J. Kim, G. Choi, and C. A. Taylor, Uncertainty quantification in coronary blood flow simulations: Impact of geometry, boundary conditions and blood viscosity,

*J. Biomech.*, 49 (2016), pp. 2540–2547.34.

R. Schefzik, T. L. Thorarinsdottir, and T. Gneiting, Uncertainty quantification in complex simulation models using ensemble copula coupling,

*Stat. Sci.*, 28 (2013), pp. 616–640.35.

M. Scheuerer and T. M. Hamill, Variogram-based proper scoring rules for probabilistic forecasts of multivariate quantities,

*Mon. Weather Rev.*, 143 (2015), pp. 1321–1334.36.

B. Schölkopf and A. J. Smola,

*Learning with Kernels*, MIT Press, Cambridge, MA, 2002.37.

D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu, Equivalence of distance-based and RKHS-based statistics in hypothesis testing,

*Ann. Stat.*, 41 (2013), pp. 2263–2291.38.

I. Steinwart and J. F. Ziegel, Strictly proper kernel scores and characteristic kernels on compact spaces,

*Appl. Comput. Harmon. Anal.*, 51 (2021), pp. 510–542.39.

G. J. Székely and M. L. Rizzo, Energy statistics: A class of statistics based on distances,

*J. Stat. Plan. Infer.*, 143 (2013), pp. 1249–1272.40.

M. Taillardat, A.-L. Fougères, P. Naveau, and R. de Fondeville,

*Extreme Events Evaluation Using CRPS Distributions*, preprint, arXiv:1905.04022, 2019.41.

T. L. Thorarinsdottir, T. Gneiting, and N. Gissibl, Using proper divergence functions to evaluate climate models,

*SIAM/ASA J. Uncertain. Quantif.*, 1 (2013), pp. 522–534, https://doi.org/10.1137/130907550.42.

J. Zscheischler, O. Martius, S. Westra, E. Bevacqua, C. Raymond, R. M. Horton, B. van den Hurk, et al., A typology of compound weather and climate events,

*Nat. Rev. Earth Environ.*, 1 (2020), pp. 333–347.## Information & Authors

### Information

#### Published In

SIAM/ASA Journal on Uncertainty Quantification

Pages: 906 - 940

DOI: 10.1137/22M1532184

ISSN (online): 2166-2525

#### Copyright

© 2023 Society for Industrial and Applied Mathematics and American Statistical Association.

#### History

**Submitted**: 4 November 2022

**Accepted**: 31 March 2023

**Published online**: 17 August 2023

#### Keywords

#### MSC codes

### Authors

#### Funding Information

Swiss Federal Office of Meteorology and Climatology (MeteoSwiss)

Oeschger Centre for Climate Change Research

**Funding:**This work was funded by the Swiss Federal Office of Meteorology and Climatology (MeteoSwiss) and the Oeschger Centre for Climate Change Research.

## Metrics & Citations

### Metrics

### Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

#### Cited By

There are no citations for this item

## View Options

### View options

### Get Access

**Access via your Institution**- Questions about how to access this content? Contact SIAM at
**[email protected]**.