7. Supplementary Discussion#

\(\require{mathtools} \newcommand{\notag}{} \newcommand{\tag}{} \newcommand{\label}[1]{} \newcommand{\sfrac}[2]{#1/#2} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\num}[1]{#1} \newcommand{\qty}[2]{#1\,#2} \renewenvironment{align} {\begin{aligned}} {\end{aligned}} \renewenvironment{alignat} {\begin{alignedat}} {\end{alignedat}} \newcommand{\pdfmspace}[1]{} % Ignore PDF-only spacing commands \newcommand{\htmlmspace}[1]{\mspace{#1}} % Ignore PDF-only spacing commands \newcommand{\scaleto}[2]{#1} % Allow to use scaleto from scalerel package \newcommand{\RR}{\mathbb R} \newcommand{\NN}{\mathbb N} \newcommand{\PP}{\mathbb P} \newcommand{\EE}{\mathbb E} \newcommand{\XX}{\mathbb X} \newcommand{\ZZ}{\mathbb Z} \newcommand{\QQ}{\mathbb Q} \newcommand{\fF}{\mathcal F} \newcommand{\dD}{\mathcal D} \newcommand{\lL}{\mathcal L} \newcommand{\gG}{\mathcal G} \newcommand{\hH}{\mathcal H} \newcommand{\nN}{\mathcal N} \newcommand{\pP}{\mathcal P} \newcommand{\BB}{\mathbb B} \newcommand{\Exp}{\operatorname{Exp}} \newcommand{\Binomial}{\operatorname{Binomial}} \newcommand{\Poisson}{\operatorname{Poisson}} \newcommand{\linop}{\mathcal{L}(\mathbb{B})} \newcommand{\linopell}{\mathcal{L}(\ell_1)} \DeclareMathOperator{\trace}{trace} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Span}{span} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\col}{col} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\gt}{>} \definecolor{highlight-blue}{RGB}{0,123,255} % definition, theorem, proposition \definecolor{highlight-yellow}{RGB}{255,193,7} % lemma, conjecture, example \definecolor{highlight-orange}{RGB}{253,126,20} % criterion, corollary, property \definecolor{highlight-red}{RGB}{220,53,69} % criterion \newcommand{\logL}{\ell} \newcommand{\eE}{\mathcal{E}} \newcommand{\oO}{\mathcal{O}} \newcommand{\defeq}{\stackrel{\mathrm{def}}{=}} \newcommand{\Bspec}{\mathcal{B}} % Spectral radiance \newcommand{\X}{\mathcal{X}} % X space \newcommand{\Y}{\mathcal{Y}} % Y space \newcommand{\M}{\mathcal{M}} % Model \newcommand{\Tspace}{\mathcal{T}} \newcommand{\Vspace}{\mathcal{V}} \newcommand{\Mtrue}{\mathcal{M}_{\mathrm{true}}} \newcommand{\MP}{\M_{\mathrm{P}}} \newcommand{\MRJ}{\M_{\mathrm{RJ}}} \newcommand{\qproc}{\mathfrak{Q}} \newcommand{\D}{\mathcal{D}} % Data (true or generic) \newcommand{\Dt}{\tilde{\mathcal{D}}} \newcommand{\Phit}{\widetilde{\Phi}} \newcommand{\Phis}{\Phi^*} \newcommand{\qt}{\tilde{q}} \newcommand{\qs}{q^*} \newcommand{\qh}{\hat{q}} \newcommand{\AB}[1]{\mathtt{AB}~\mathtt{#1}} \newcommand{\LP}[1]{\mathtt{LP}~\mathtt{#1}} \newcommand{\NML}{\mathrm{NML}} \newcommand{\iI}{\mathcal{I}} \newcommand{\true}{\mathrm{true}} \newcommand{\dist}{D} \newcommand{\Mtheo}[1]{\mathcal{M}_{#1}} % Model (theoretical model); index: param set \newcommand{\DL}[1][L]{\mathcal{D}^{(#1)}} % Data (RV or generic) \newcommand{\DLp}[1][L]{\mathcal{D}^{(#1')}} % Data (RV or generic) \newcommand{\DtL}[1][L]{\tilde{\mathcal{D}}^{(#1)}} % Data (RV or generic) \newcommand{\DpL}[1][L]{{\mathcal{D}'}^{(#1)}} % Data (RV or generic) \newcommand{\Dobs}[1][]{\mathcal{D}_{\mathrm{obs}}^{#1}} % Data (observed) \newcommand{\calibset}{\mathcal{C}} \newcommand{\N}{\mathcal{N}} % Normal distribution \newcommand{\Z}{\mathcal{Z}} % Partition function \newcommand{\VV}{\mathbb{V}} % Variance \newcommand{\T}{\mathsf{T}} % Transpose \newcommand{\EMD}{\mathrm{EMD}} \newcommand{\dEMD}{d_{\mathrm{EMD}}} \newcommand{\dEMDtilde}{\tilde{d}_{\mathrm{EMD}}} \newcommand{\dEMDsafe}{d_{\mathrm{EMD}}^{\text{(safe)}}} \newcommand{\e}{ε} % Model confusion threshold \newcommand{\falsifythreshold}{ε} \newcommand{\bayes}[1][]{B_{#1}} \newcommand{\bayesthresh}[1][]{B_{0}} \newcommand{\bayesm}[1][]{B^{\mathcal{M}}_{#1}} \newcommand{\bayesl}[1][]{B^l_{#1}} \newcommand{\bayesphys}[1][]{B^{{p}}_{#1}} \newcommand{\Bconf}[1]{B^{\mathrm{epis}}_{#1}} \newcommand{\Bemd}[1]{B^{\mathrm{EMD}}_{#1}} \newcommand{\BQ}[1]{B^{Q}_{#1}} \newcommand{\Bconfbin}[1][]{\bar{B}^{\mathrm{conf}}_{#1}} \newcommand{\Bemdbin}[1][]{\bar{B}_{#1}^{\mathrm{EMD}}} \newcommand{\bin}{\mathcal{B}} \newcommand{\Bconft}[1][]{\tilde{B}^{\mathrm{conf}}_{#1}} \newcommand{\fc}{f_c} \newcommand{\fcbin}{\bar{f}_c} \newcommand{\paramphys}[1][]{Θ^{{p}}_{#1}} \newcommand{\paramobs}[1][]{Θ^{ε}_{#1}} \newcommand{\test}{\mathrm{test}} \newcommand{\train}{\mathrm{train}} \newcommand{\synth}{\mathrm{synth}} \newcommand{\rep}{\mathrm{rep}} \newcommand{\MNtrue}{\mathcal{M}^{{p}}_{\text{true}}} \newcommand{\MN}[1][]{\mathcal{M}^{{p}}_{#1}} \newcommand{\MNA}{\mathcal{M}^{{p}}_{Θ_A}} \newcommand{\MNB}{\mathcal{M}^{{p}}_{Θ_B}} \newcommand{\Me}[1][]{\mathcal{M}^ε_{#1}} \newcommand{\Metrue}{\mathcal{M}^ε_{\text{true}}} \newcommand{\Meobs}{\mathcal{M}^ε_{\text{obs}}} \newcommand{\Meh}[1][]{\hat{\mathcal{M}}^ε_{#1}} \newcommand{\MNa}{\mathcal{M}^{\mathcal{N}}_a} \newcommand{\MeA}{\mathcal{M}^ε_A} \newcommand{\MeB}{\mathcal{M}^ε_B} \newcommand{\Ms}{\mathcal{M}^*} \newcommand{\MsA}{\mathcal{M}^*_A} \newcommand{\MsB}{\mathcal{M}^*_B} \newcommand{\Msa}{\mathcal{M}^*_a} \newcommand{\MsAz}{\mathcal{M}^*_{A,z}} \newcommand{\MsBz}{\mathcal{M}^*_{B,z}} \newcommand{\Msaz}{\mathcal{M}^*_{a,z}} \newcommand{\MeAz}{\mathcal{M}^ε_{A,z}} \newcommand{\MeBz}{\mathcal{M}^ε_{B,z}} \newcommand{\Meaz}{\mathcal{M}^ε_{a,z}} \newcommand{\zo}{z^{0}} \renewcommand{\lL}[2][]{\mathcal{L}_{#1|{#2}}} % likelihood \newcommand{\Lavg}[2][]{\mathcal{L}^{/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLphys}[2][]{\mathcal{L}^{{p}}_{#1|#2}} \newcommand{\Lavgphys}[2][]{\mathcal{L}^{{p}/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLL}[3][]{\mathcal{L}^{(#3)}_{#1|#2}} \newcommand{\lLphysL}[3][]{\mathcal{L}^{{p},(#3)}_{#1|#2}} \newcommand{\lnL}[2][]{l_{#1|#2}} % Per-sample log likelihood \newcommand{\lnLt}[2][]{\widetilde{l}_{#1|#2}} \newcommand{\lnLtt}{\widetilde{l}} % Used only in path_sampling \newcommand{\lnLh}[1][]{\hat{l}_{#1}} \newcommand{\lnLphys}[2][]{l^{{p}}_{#1|#2}} \newcommand{\lnLphysL}[3][]{l^{{p},(#3)}_{#1|#2}} \newcommand{\Elmu}[2][1]{μ_{{#2}}^{(#1)}} \newcommand{\Elmuh}[2][1]{\hat{μ}_{{#2}}^{(#1)}} \newcommand{\Elsig}[2][1]{Σ_{{#2}}^{(#1)}} \newcommand{\Elsigh}[2][1]{\hat{Σ}_{{#2}}^{(#1)}} \newcommand{\pathP}{\mathop{{p}}} % Path-sampling process (generic) \newcommand{\pathPhb}{\mathop{{p}}_{\mathrm{Beta}}} % Path-sampling process (hierarchical beta) \newcommand{\interval}{\mathcal{I}} \newcommand{\Phiset}[1]{\{\Phi\}^{\small (#1)}} \newcommand{\Phipart}[1]{\{\mathcal{I}_Φ\}^{\small (#1)}} \newcommand{\qhset}[1]{\{\qh\}^{\small (#1)}} \newcommand{\Dqpart}[1]{\{Δ\qh_{2^{#1}}\}} \newcommand{\LsAzl}{\mathcal{L}_{\smash{{}^{\,*}_A},z,L}} \newcommand{\LsBzl}{\mathcal{L}_{\smash{{}^{\,*}_B},z,L}} \newcommand{\lsA}{l_{\smash{{}^{\,*}_A}}} \newcommand{\lsB}{l_{\smash{{}^{\,*}_B}}} \newcommand{\lsAz}{l_{\smash{{}^{\,*}_A},z}} \newcommand{\lsAzj}{l_{\smash{{}^{\,*}_A},z_j}} \newcommand{\lsAzo}{l_{\smash{{}^{\,*}_A},z^0}} \newcommand{\leAz}{l_{\smash{{}^{\,ε}_A},z}} \newcommand{\lsAez}{l_{\smash{{}^{*ε}_A},z}} \newcommand{\lsBz}{l_{\smash{{}^{\,*}_B},z}} \newcommand{\lsBzj}{l_{\smash{{}^{\,*}_B},z_j}} \newcommand{\lsBzo}{l_{\smash{{}^{\,*}_B},z^0}} \newcommand{\leBz}{l_{\smash{{}^{\,ε}_B},z}} \newcommand{\lsBez}{l_{\smash{{}^{*ε}_B},z}} \newcommand{\LaszL}{\mathcal{L}_{\smash{{}^{*}_a},z,L}} \newcommand{\lasz}{l_{\smash{{}^{*}_a},z}} \newcommand{\laszj}{l_{\smash{{}^{*}_a},z_j}} \newcommand{\laszo}{l_{\smash{{}^{*}_a},z^0}} \newcommand{\laez}{l_{\smash{{}^{ε}_a},z}} \newcommand{\lasez}{l_{\smash{{}^{*ε}_a},z}} \newcommand{\lhatasz}{\hat{l}_{\smash{{}^{*}_a},z}} \newcommand{\pasz}{p_{\smash{{}^{*}_a},z}} \newcommand{\paez}{p_{\smash{{}^{ε}_a},z}} \newcommand{\pasez}{p_{\smash{{}^{*ε}_a},z}} \newcommand{\phatsaz}{\hat{p}_{\smash{{}^{*}_a},z}} \newcommand{\phateaz}{\hat{p}_{\smash{{}^{ε}_a},z}} \newcommand{\phatseaz}{\hat{p}_{\smash{{}^{*ε}_a},z}} \newcommand{\Phil}[2][]{Φ_{#1|#2}} % Φ_{\la} \newcommand{\Philt}[2][]{\widetilde{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philhat}[2][]{\hat{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philsaz}{Φ_{\smash{{}^{*}_a},z}} % Φ_{\lasz} \newcommand{\Phileaz}{Φ_{\smash{{}^{ε}_a},z}} % Φ_{\laez} \newcommand{\Philseaz}{Φ_{\smash{{}^{*ε}_a},z}} % Φ_{\lasez} \newcommand{\mus}[1][1]{μ^{(#1)}_*} \newcommand{\musA}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\SigsA}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\musB}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\SigsB}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\musa}[1][1]{μ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Sigsa}[1][1]{Σ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Msah}{{\color{highlight-red}\mathcal{M}^{*}_a}} \newcommand{\Msazh}{{\color{highlight-red}\mathcal{M}^{*}_{a,z}}} \newcommand{\Meah}{{\color{highlight-blue}\mathcal{M}^{ε}_a}} \newcommand{\Meazh}{{\color{highlight-blue}\mathcal{M}^{ε}_{a,z}}} \newcommand{\lsazh}{{\color{highlight-red}l_{\smash{{}^{*}_a},z}}} \newcommand{\leazh}{{\color{highlight-blue}l_{\smash{{}^{ε}_a},z}}} \newcommand{\lseazh}{{\color{highlight-orange}l_{\smash{{}^{*ε}_a},z}}} \newcommand{\Philsazh}{{\color{highlight-red}Φ_{\smash{{}^{*}_a},z}}} % Φ_{\lasz} \newcommand{\Phileazh}{{\color{highlight-blue}Φ_{\smash{{}^{ε}_a},z}}} % Φ_{\laez} \newcommand{\Philseazh}{{\color{highlight-orange}Φ_{\smash{{}^{*ε}_a},z}}} % Φ_{\lasez} \newcommand{\emdstd}{\tilde{σ}} \DeclareMathOperator{\Mvar}{Mvar} \DeclareMathOperator{\AIC}{AIC} \DeclareMathOperator{\epll}{epll} \DeclareMathOperator{\elpd}{elpd} \DeclareMathOperator{\MDL}{MDL} \DeclareMathOperator{\comp}{COMP} \DeclareMathOperator{\Lognorm}{Lognorm} \DeclareMathOperator{\erf}{erf} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\Image}{Image} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\SE}{SE} % standard error \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\SkewNormal}{SkewNormal} \DeclareMathOperator{\TruncNormal}{TruncNormal} \DeclareMathOperator{\Exponential}{Exponential} \DeclareMathOperator{\exGaussian}{exGaussian} \DeclareMathOperator{\IG}{IG} \DeclareMathOperator{\NIG}{NIG} \DeclareMathOperator{\Gammadist}{Gamma} \DeclareMathOperator{\Lognormal}{Lognormal} \DeclareMathOperator{\Beta}{Beta} \newcommand{\sinf}{{s_{\infty}}}\)

7.1. Other forms of uncertainty#

As we say in the main text, there are two main sources of epistemic uncertainty on an estimate of the risk: limited number of samples and variability in the replication process. This work focusses on the estimating replication uncertainty, and numerically studying its effect as a function of sample size. We have eschewed a formal treatment of sample size effects, since this is a well-studied problem and good estimation procedures already exist.

For example, a bootstrap procedure can be used to estimate the uncertainty on a statistic (here the risk \(R\)) from a single dataset \(\D\), by recomputing the statistic on multiple surrogate datasets obtained by resampling \(\D\) with replacement (Fig. 7.1b). Alternatively, if we have access to good candidate models, we can use those models as simulators to generate multiple synthetic datasets (Fig. 7.1c). The distribution of risks over those datasets is then a direct estimate of its uncertainty due to finite samples.

In the limit of infinite data, both of these methods produce a risk “distribution” which collapses onto a precise value, independent of any discrepancy between model and true data-generating process. In contrast, that discrepancy defines the spread of risk distributions in Fig. 2.2, which do not collapse when \(L \to \infty\).

Aleatoric uncertainty also does not vanish in the large \(L\) limit, but manifests differently. It sets a lower bound on the spread of pointwise losses a model can achieve (Fig. 7.1a). In terms of our formalism therefore:

  • aleatoric uncertainty determines the shape of the PPFs \(\qs\) and \(\qt\),

  • epistemic uncertainty (due to finite samples) is the statistical uncertainty on those shapes, and

  • epistemic uncertainty (across replicates) is the propensity of the PPF to change when the experiment is replicated.

Both forms of epistemic uncertainty can contribute uncertainty on the risk, i.e. increase the spread of the \(R\)-distribution, but in the \(L \to \infty\) limit only the effect of replications remains. Changes to aleatoric uncertainty will shift \(R\)-distributions along the \(R\) axis, but do not directly contribute epistemic uncertainty.

../../_images/prinz_aleatoric-not-equiv.svg

Fig. 7.1 Aleatoric and finite-size uncertainty. a) Loss distribution of individual data points — \(\{Q_a(t_k, V^{\LP{\!\!}}(t_k; \Mtrue))\}\) — for the dataset and models shown in Fig. 2.1 and \(a \in \{A,B,C,D\}\). For a model which predicts the data well, this is mostly determined by the aleatoric uncertainty. b) Bootstrap estimate of finite-size uncertainty on the risk, obtained using case resampling [77]: for each model, the set of losses was resampled 1200 times with replacement. Dataset and colours are the same as in (a); the same \(L\) data points are used for all models. c) Synthetic estimate of finite-size uncertainty on the risk, obtained by evaluating Eq. 2.7 on 400 different simulations of the candidate model (differing by the random seed). Here the same model is used for simulation and loss evaluation. Dataset sizes \(L\) determine the integration time, adjusted so all datasets contain the same number of spikes. All subpanels use the same vertical scale. b–c) The variance of the \(R\)-distributions, i.e. the uncertainty on \(R\), goes to zero as \(L\) is increased. a–c) Colours indicate the model used for the loss. Probability densities were obtained by a kernel density estimate (KDE). [source]#

7.2. Flexibility in selecting \(c\)#

An important property of our approach is that sensitivity parameter \(c\) does not need to be tuned to a precise value, but can lie within a range; either \(c \in [2^{-4}, 2^0]\) for the neuron models of Fig. 2.5, or \(c \in [2^{-6}, 2^3]\) for the Planck and Rayleigh-Jeans models of Fig. 4.1. This does not mean that the value of \(\Bemd{ab;c}\) itself is insensitive to \(c\): for fixed data, a larger \(c\) will generally bring \(\Bemd{ab;c}\) closer to 50%. But the probability bound given by \(\Bemd{ab;c}\) remains correct for all \(c\) within that range.

For example, we compute \(\Bemd{\mathrm{P},\mathrm{RJ};c}\) to be 80% when \(c=2^{-3}\), versus 60% when \(c=2^{0}\). [source] This means that if we fix \(c=2^{-3}\) (resp. \(c=2^{0}\)), among all experiments which yield \(\Bemd{\mathrm{P},\mathrm{RJ};c} = 80\%\) (resp. \(\Bemd{\mathrm{P},\mathrm{RJ};c} = 60\%\)), in at least 80% (resp. 60%) of them model \(\M_{\mathrm{P}}\) will have lower true risk than \(\M_{\mathrm{RJ}}\).

More generally, a \(\Bemd{ab;c} = B\) (with \(0.5 < B \leq 1\)) states that, of all the experiments where the calculation of \(\Bemd{ab;c}\) is equal to \(B\), at least a fraction \(B\) of those will have \(R_a < R_b\). In this respect the \(\Bemd{}\) exhibits some similarities with a confidence interval, in that it is interpreted in terms of replications under a fixed computational procedure: in the former case we have a fixed procedure for calculating \(\Bemd{ab;c}\) given \(c\), while in the latter case we have a fixed procedure for calculating the confidence interval given a confidence level. A key difference however is that with the \(\Bemd{}\), the interpretation requires conditioning on the outcome of the calculation.

Of course the range of valid \(c\) values will depend on the variety of epistemic distributions, and cannot be guaranteed. In general, the larger the variety, the more difficult one can expect it to be to find a \(c\) which is valid in all conditions. We can however anticipate a few strategies which might increase the range of valid \(c\) values and otherwise improve the ability of the \(\Bemd{}\) criterion to discriminate between models:

Increased rejection threshold

Increasing the rejection threshold \(\falsifythreshold\) in Eq. 2.12 can be a simple way to add a safety margin to the \(\Bemd{}\) criterion, at the cost of some statistical power, to account for small violations of Eq. 2.31.

Multi-step comparisons

Initially, the large number of candidate models may force the selection of a larger (i.e. more conservative) \(c\). After using this \(c\) to reject some of the candidates, it may be possible to reduce the value of \(c\), thus increasing the discriminatory power and possibly further reducing the remaining pool of candidates.

Improved experimental control

If one can improve the reproducibility across experiments, the epistemic distributions used for calibration can correspondingly be made tighter. This makes it easier to find a \(c\) which works in all experimental conditions, since there are overall fewer conditions to satisfy.

Domain-informed loss function

A loss function designed with knowledge of the domain or target application can ignore irrelevant differences between models. In addition to improving the relevance of comparisons, this tends to make the risk a smoother function of experimental parameters, which should make the \(\Bemd{}\) easier to calibrate.

Post hoc correction of \(\Bemd{}\) values

As long as we select a \(c\) for which the \(\Bconf{AB;Ω}\Bigl(\Bemd{AB;c}\Bigr)\) function of Eq. 4.4 is monotone, we can use the calibration curves themselves to interpret values of \(\Bemd{AB;c}\) by looking to the \(\Bconf{AB;Ω}\) to which they map. This approach could be used to improve discriminatory power of a conservative \(\Bemd{}\), but also to correct it in regions where it is overconfident — assuming one has sufficient trust that the calibration curves are truly representative of experimental variations.

7.3. Comparing models directly with the loss distribution#

To further illustrate the previous point, we consider an alternative comparison criterion which directly uses the distribution of losses \(Q\) (i.e. the distribution in Fig. 7.1a) instead of the more complicated process \(\qproc\) over PPFs with which we defined the \(\Bemd{}\). To this end, let us define a \(B^Q\) criterion in analogy with Eq. 2.13:

(7.1)#\[\begin{split}\begin{aligned} \BQ{ab;c_Q} &\coloneqq P(Q_a < Q_b + η)\,, \\ η &~\sim \nN(0, c_Q^2) \,. \end{aligned}\end{split}\]

The Gaussian noise \(η\) is added to allow us to adjust the sensitivity of the criterion, analogously to how \(c\) adjusts the sensitivity of the \(\Bemd{ab;c}\) criterion. Both criteria thus have one free parameter, making the comparison relatively fair.

A possible argument in favour of using \(\BQ{}\) as a criterion is that increasing the amount of misspecification — for example by increasing the unmodelled bias \(\Bspec_0\) in Eq. 2.34 — will affect the distribution of losses in some way. So although the shape of the PPFs is mostly determined by aleatoric uncertainty (as evidenced by the similarity of the distributions in Fig. 7.1a), one can expect them to also contain some information about the amount of misspecification — and thereby also the amount of epistemic uncertainty.

../../_images/BQ-to-Bemd-comparison_prinz_calib-scatter.svg

Fig. 7.3 (a) Calibration experiments for the neuron model, using \(\BQ{}\) (Eq. 7.1) instead of \(\Bemd{}\) (Eq. 2.13) as a comparison criterion. To better show the distribution of experiments, each histogram bin (see Calibration experiments in the Methods) is represented as a point. (b) Same curves as in Fig. 2.5, this time with bins presented as points to ease comparison with (a). Compared to (a), points are more uniformly distributed along both the horizontal and vertical axes. All panels use the same set of sensitivity values (either \(c_Q\) or \(c\)), with colours as indicated in the central legend.
[source]
#

Nevertheless, as we see in Fig. 7.3 , the \(\BQ{}\) criterion is less effective for comparing models than the \(\Bemd{}\) in at least three ways (recall that the goal is not to find a criterion which accurately predicts the model with lowest risk, but one which accurately predicts the uncertainty on that risk, i.e. \(\Bconf{}\)):

Reduced signal

Only very few experiments on the main diagonal have \(\Bconf{}\) values different from 0 or 1. The \(\BQ{}\) values therefore don’t really inform about the epistemic uncertainty. Learning a monotone transformation from \(\Bemd{}\) or \(\BQ{}\) to \(\Bconf{}\) is only possible if the curve is strictly monotone.

Increased anti-correlation

between \(\BQ{}\) and \(\Bconf{}\): in five out of six comparisons between neuron models, we see strong anticorrelated tails at both ends of the curves, which dip all the way back to \(\Bconf{} \approx 0.5\). This makes it harder to find a \(c_Q\) for which the the \(B^Q\) is actually informative, since the curve is not even injective. While there is some anticorelation also with the \(\Bemd{}\) (Fig. 7.3b), it is much less pronounced and limited to comparisons with the worse models (\(\M_C\) and \(\M_D\)).

The sensitivity parameter does not really help

Increasing \(c_Q\) squeezes curves horizontally, bringing all \(\BQ{}\) values closer to 0.5, but does not change their shape nor reduce the length of anticorrelated tails at both ends. All curves therefore effectively provide the same information. In contrast, the \(c\) parameter has a much stronger effect on the shape of \(R\)-distributions (c.f. Fig. 6.2).

These results suggest that an approach based only on distributions of the loss \(Q\) is likely to always struggle with disentangling epistemic from aleatoric uncertainty. It also has less information to work with: whereas \(\Bemd{}\) is parameterised by two functions, \(\qs\) and \(δ^{\EMD}\), \(\BQ{}\) is parameterised by only one, the density \(p(Q)\). Moreover, \(p(Q)\) and \(\qs\) contain the same information (they are the PDF and PPF of the same random variable), so the discrepancy function \(δ^{\EMD}\) provides \(\Bemd{}\) with strictly more information — information which is also specifically designed to disentangle the effects of misspecification from aleatoric uncertainty.