Different criteria embody different notions of robustness

2.9. Different criteria embody different notions of robustness#

\(\require{mathtools} \newcommand{\notag}{} \newcommand{\tag}{} \newcommand{\label}[1]{} \newcommand{\sfrac}[2]{#1/#2} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\num}[1]{#1} \newcommand{\qty}[2]{#1\,#2} \renewenvironment{align} {\begin{aligned}} {\end{aligned}} \renewenvironment{alignat} {\begin{alignedat}} {\end{alignedat}} \newcommand{\pdfmspace}[1]{} % Ignore PDF-only spacing commands \newcommand{\htmlmspace}[1]{\mspace{#1}} % Ignore PDF-only spacing commands \newcommand{\scaleto}[2]{#1} % Allow to use scaleto from scalerel package \newcommand{\RR}{\mathbb R} \newcommand{\NN}{\mathbb N} \newcommand{\PP}{\mathbb P} \newcommand{\EE}{\mathbb E} \newcommand{\XX}{\mathbb X} \newcommand{\ZZ}{\mathbb Z} \newcommand{\QQ}{\mathbb Q} \newcommand{\fF}{\mathcal F} \newcommand{\dD}{\mathcal D} \newcommand{\lL}{\mathcal L} \newcommand{\gG}{\mathcal G} \newcommand{\hH}{\mathcal H} \newcommand{\nN}{\mathcal N} \newcommand{\pP}{\mathcal P} \newcommand{\BB}{\mathbb B} \newcommand{\Exp}{\operatorname{Exp}} \newcommand{\Binomial}{\operatorname{Binomial}} \newcommand{\Poisson}{\operatorname{Poisson}} \newcommand{\linop}{\mathcal{L}(\mathbb{B})} \newcommand{\linopell}{\mathcal{L}(\ell_1)} \DeclareMathOperator{\trace}{trace} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Span}{span} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\col}{col} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\gt}{>} \definecolor{highlight-blue}{RGB}{0,123,255} % definition, theorem, proposition \definecolor{highlight-yellow}{RGB}{255,193,7} % lemma, conjecture, example \definecolor{highlight-orange}{RGB}{253,126,20} % criterion, corollary, property \definecolor{highlight-red}{RGB}{220,53,69} % criterion \newcommand{\logL}{\ell} \newcommand{\eE}{\mathcal{E}} \newcommand{\oO}{\mathcal{O}} \newcommand{\defeq}{\stackrel{\mathrm{def}}{=}} \newcommand{\Bspec}{\mathcal{B}} % Spectral radiance \newcommand{\X}{\mathcal{X}} % X space \newcommand{\Y}{\mathcal{Y}} % Y space \newcommand{\M}{\mathcal{M}} % Model \newcommand{\Tspace}{\mathcal{T}} \newcommand{\Vspace}{\mathcal{V}} \newcommand{\Mtrue}{\mathcal{M}_{\mathrm{true}}} \newcommand{\MP}{\M_{\mathrm{P}}} \newcommand{\MRJ}{\M_{\mathrm{RJ}}} \newcommand{\qproc}{\mathfrak{Q}} \newcommand{\D}{\mathcal{D}} % Data (true or generic) \newcommand{\Dt}{\tilde{\mathcal{D}}} \newcommand{\Phit}{\widetilde{\Phi}} \newcommand{\Phis}{\Phi^*} \newcommand{\qt}{\tilde{q}} \newcommand{\qs}{q^*} \newcommand{\qh}{\hat{q}} \newcommand{\AB}[1]{\mathtt{AB}~\mathtt{#1}} \newcommand{\LP}[1]{\mathtt{LP}~\mathtt{#1}} \newcommand{\NML}{\mathrm{NML}} \newcommand{\iI}{\mathcal{I}} \newcommand{\true}{\mathrm{true}} \newcommand{\dist}{D} \newcommand{\Mtheo}[1]{\mathcal{M}_{#1}} % Model (theoretical model); index: param set \newcommand{\DL}[1][L]{\mathcal{D}^{(#1)}} % Data (RV or generic) \newcommand{\DLp}[1][L]{\mathcal{D}^{(#1')}} % Data (RV or generic) \newcommand{\DtL}[1][L]{\tilde{\mathcal{D}}^{(#1)}} % Data (RV or generic) \newcommand{\DpL}[1][L]{{\mathcal{D}'}^{(#1)}} % Data (RV or generic) \newcommand{\Dobs}[1][]{\mathcal{D}_{\mathrm{obs}}^{#1}} % Data (observed) \newcommand{\calibset}{\mathcal{C}} \newcommand{\N}{\mathcal{N}} % Normal distribution \newcommand{\Z}{\mathcal{Z}} % Partition function \newcommand{\VV}{\mathbb{V}} % Variance \newcommand{\T}{\mathsf{T}} % Transpose \newcommand{\EMD}{\mathrm{EMD}} \newcommand{\dEMD}{d_{\mathrm{EMD}}} \newcommand{\dEMDtilde}{\tilde{d}_{\mathrm{EMD}}} \newcommand{\dEMDsafe}{d_{\mathrm{EMD}}^{\text{(safe)}}} \newcommand{\e}{ε} % Model confusion threshold \newcommand{\falsifythreshold}{ε} \newcommand{\bayes}[1][]{B_{#1}} \newcommand{\bayesthresh}[1][]{B_{0}} \newcommand{\bayesm}[1][]{B^{\mathcal{M}}_{#1}} \newcommand{\bayesl}[1][]{B^l_{#1}} \newcommand{\bayesphys}[1][]{B^{{p}}_{#1}} \newcommand{\Bconf}[1]{B^{\mathrm{epis}}_{#1}} \newcommand{\Bemd}[1]{B^{\mathrm{EMD}}_{#1}} \newcommand{\BQ}[1]{B^{Q}_{#1}} \newcommand{\Bconfbin}[1][]{\bar{B}^{\mathrm{conf}}_{#1}} \newcommand{\Bemdbin}[1][]{\bar{B}_{#1}^{\mathrm{EMD}}} \newcommand{\bin}{\mathcal{B}} \newcommand{\Bconft}[1][]{\tilde{B}^{\mathrm{conf}}_{#1}} \newcommand{\fc}{f_c} \newcommand{\fcbin}{\bar{f}_c} \newcommand{\paramphys}[1][]{Θ^{{p}}_{#1}} \newcommand{\paramobs}[1][]{Θ^{ε}_{#1}} \newcommand{\test}{\mathrm{test}} \newcommand{\train}{\mathrm{train}} \newcommand{\synth}{\mathrm{synth}} \newcommand{\rep}{\mathrm{rep}} \newcommand{\MNtrue}{\mathcal{M}^{{p}}_{\text{true}}} \newcommand{\MN}[1][]{\mathcal{M}^{{p}}_{#1}} \newcommand{\MNA}{\mathcal{M}^{{p}}_{Θ_A}} \newcommand{\MNB}{\mathcal{M}^{{p}}_{Θ_B}} \newcommand{\Me}[1][]{\mathcal{M}^ε_{#1}} \newcommand{\Metrue}{\mathcal{M}^ε_{\text{true}}} \newcommand{\Meobs}{\mathcal{M}^ε_{\text{obs}}} \newcommand{\Meh}[1][]{\hat{\mathcal{M}}^ε_{#1}} \newcommand{\MNa}{\mathcal{M}^{\mathcal{N}}_a} \newcommand{\MeA}{\mathcal{M}^ε_A} \newcommand{\MeB}{\mathcal{M}^ε_B} \newcommand{\Ms}{\mathcal{M}^*} \newcommand{\MsA}{\mathcal{M}^*_A} \newcommand{\MsB}{\mathcal{M}^*_B} \newcommand{\Msa}{\mathcal{M}^*_a} \newcommand{\MsAz}{\mathcal{M}^*_{A,z}} \newcommand{\MsBz}{\mathcal{M}^*_{B,z}} \newcommand{\Msaz}{\mathcal{M}^*_{a,z}} \newcommand{\MeAz}{\mathcal{M}^ε_{A,z}} \newcommand{\MeBz}{\mathcal{M}^ε_{B,z}} \newcommand{\Meaz}{\mathcal{M}^ε_{a,z}} \newcommand{\zo}{z^{0}} \renewcommand{\lL}[2][]{\mathcal{L}_{#1|{#2}}} % likelihood \newcommand{\Lavg}[2][]{\mathcal{L}^{/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLphys}[2][]{\mathcal{L}^{{p}}_{#1|#2}} \newcommand{\Lavgphys}[2][]{\mathcal{L}^{{p}/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLL}[3][]{\mathcal{L}^{(#3)}_{#1|#2}} \newcommand{\lLphysL}[3][]{\mathcal{L}^{{p},(#3)}_{#1|#2}} \newcommand{\lnL}[2][]{l_{#1|#2}} % Per-sample log likelihood \newcommand{\lnLt}[2][]{\widetilde{l}_{#1|#2}} \newcommand{\lnLtt}{\widetilde{l}} % Used only in path_sampling \newcommand{\lnLh}[1][]{\hat{l}_{#1}} \newcommand{\lnLphys}[2][]{l^{{p}}_{#1|#2}} \newcommand{\lnLphysL}[3][]{l^{{p},(#3)}_{#1|#2}} \newcommand{\Elmu}[2][1]{μ_{{#2}}^{(#1)}} \newcommand{\Elmuh}[2][1]{\hat{μ}_{{#2}}^{(#1)}} \newcommand{\Elsig}[2][1]{Σ_{{#2}}^{(#1)}} \newcommand{\Elsigh}[2][1]{\hat{Σ}_{{#2}}^{(#1)}} \newcommand{\pathP}{\mathop{{p}}} % Path-sampling process (generic) \newcommand{\pathPhb}{\mathop{{p}}_{\mathrm{Beta}}} % Path-sampling process (hierarchical beta) \newcommand{\interval}{\mathcal{I}} \newcommand{\Phiset}[1]{\{\Phi\}^{\small (#1)}} \newcommand{\Phipart}[1]{\{\mathcal{I}_Φ\}^{\small (#1)}} \newcommand{\qhset}[1]{\{\qh\}^{\small (#1)}} \newcommand{\Dqpart}[1]{\{Δ\qh_{2^{#1}}\}} \newcommand{\LsAzl}{\mathcal{L}_{\smash{{}^{\,*}_A},z,L}} \newcommand{\LsBzl}{\mathcal{L}_{\smash{{}^{\,*}_B},z,L}} \newcommand{\lsA}{l_{\smash{{}^{\,*}_A}}} \newcommand{\lsB}{l_{\smash{{}^{\,*}_B}}} \newcommand{\lsAz}{l_{\smash{{}^{\,*}_A},z}} \newcommand{\lsAzj}{l_{\smash{{}^{\,*}_A},z_j}} \newcommand{\lsAzo}{l_{\smash{{}^{\,*}_A},z^0}} \newcommand{\leAz}{l_{\smash{{}^{\,ε}_A},z}} \newcommand{\lsAez}{l_{\smash{{}^{*ε}_A},z}} \newcommand{\lsBz}{l_{\smash{{}^{\,*}_B},z}} \newcommand{\lsBzj}{l_{\smash{{}^{\,*}_B},z_j}} \newcommand{\lsBzo}{l_{\smash{{}^{\,*}_B},z^0}} \newcommand{\leBz}{l_{\smash{{}^{\,ε}_B},z}} \newcommand{\lsBez}{l_{\smash{{}^{*ε}_B},z}} \newcommand{\LaszL}{\mathcal{L}_{\smash{{}^{*}_a},z,L}} \newcommand{\lasz}{l_{\smash{{}^{*}_a},z}} \newcommand{\laszj}{l_{\smash{{}^{*}_a},z_j}} \newcommand{\laszo}{l_{\smash{{}^{*}_a},z^0}} \newcommand{\laez}{l_{\smash{{}^{ε}_a},z}} \newcommand{\lasez}{l_{\smash{{}^{*ε}_a},z}} \newcommand{\lhatasz}{\hat{l}_{\smash{{}^{*}_a},z}} \newcommand{\pasz}{p_{\smash{{}^{*}_a},z}} \newcommand{\paez}{p_{\smash{{}^{ε}_a},z}} \newcommand{\pasez}{p_{\smash{{}^{*ε}_a},z}} \newcommand{\phatsaz}{\hat{p}_{\smash{{}^{*}_a},z}} \newcommand{\phateaz}{\hat{p}_{\smash{{}^{ε}_a},z}} \newcommand{\phatseaz}{\hat{p}_{\smash{{}^{*ε}_a},z}} \newcommand{\Phil}[2][]{Φ_{#1|#2}} % Φ_{\la} \newcommand{\Philt}[2][]{\widetilde{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philhat}[2][]{\hat{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philsaz}{Φ_{\smash{{}^{*}_a},z}} % Φ_{\lasz} \newcommand{\Phileaz}{Φ_{\smash{{}^{ε}_a},z}} % Φ_{\laez} \newcommand{\Philseaz}{Φ_{\smash{{}^{*ε}_a},z}} % Φ_{\lasez} \newcommand{\mus}[1][1]{μ^{(#1)}_*} \newcommand{\musA}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\SigsA}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\musB}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\SigsB}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\musa}[1][1]{μ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Sigsa}[1][1]{Σ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Msah}{{\color{highlight-red}\mathcal{M}^{*}_a}} \newcommand{\Msazh}{{\color{highlight-red}\mathcal{M}^{*}_{a,z}}} \newcommand{\Meah}{{\color{highlight-blue}\mathcal{M}^{ε}_a}} \newcommand{\Meazh}{{\color{highlight-blue}\mathcal{M}^{ε}_{a,z}}} \newcommand{\lsazh}{{\color{highlight-red}l_{\smash{{}^{*}_a},z}}} \newcommand{\leazh}{{\color{highlight-blue}l_{\smash{{}^{ε}_a},z}}} \newcommand{\lseazh}{{\color{highlight-orange}l_{\smash{{}^{*ε}_a},z}}} \newcommand{\Philsazh}{{\color{highlight-red}Φ_{\smash{{}^{*}_a},z}}} % Φ_{\lasz} \newcommand{\Phileazh}{{\color{highlight-blue}Φ_{\smash{{}^{ε}_a},z}}} % Φ_{\laez} \newcommand{\Philseazh}{{\color{highlight-orange}Φ_{\smash{{}^{*ε}_a},z}}} % Φ_{\lasez} \newcommand{\emdstd}{\tilde{σ}} \DeclareMathOperator{\Mvar}{Mvar} \DeclareMathOperator{\AIC}{AIC} \DeclareMathOperator{\epll}{epll} \DeclareMathOperator{\elpd}{elpd} \DeclareMathOperator{\MDL}{MDL} \DeclareMathOperator{\comp}{COMP} \DeclareMathOperator{\Lognorm}{Lognorm} \DeclareMathOperator{\erf}{erf} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\Image}{Image} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\SE}{SE} % standard error \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\SkewNormal}{SkewNormal} \DeclareMathOperator{\TruncNormal}{TruncNormal} \DeclareMathOperator{\Exponential}{Exponential} \DeclareMathOperator{\exGaussian}{exGaussian} \DeclareMathOperator{\IG}{IG} \DeclareMathOperator{\NIG}{NIG} \DeclareMathOperator{\Gammadist}{Gamma} \DeclareMathOperator{\Lognormal}{Lognormal} \DeclareMathOperator{\Beta}{Beta} \newcommand{\sinf}{{s_{\infty}}}\)

As a final step, we now wish to locate our proposed method within the wider constellation of model selection methods.

The motivation behind any model selection criterion is to make the selection in some way robust — i.e. to encourage a choice which is good not just for the given particular data and models, but also for variations thereof. Otherwise we would just select the model with the lowest empirical risk and be done with it. Criteria vary widely however in terms of what kinds of variations they account for. For the purposes of comparing with the EMD rejection rule, we consider three types:

In-distribution variations of the data

where new samples are drawn from the same data-generating process \(\Mtrue\) as the data.

Out-of-distribution variations of the data

where new samples are drawn from a different data-generating process \(\Mtrue'\) (than the data-generating process \(\Mtrue\)). How much \(\Mtrue'\) is allowed to differ from \(\Mtrue\) will depend on the target application.

Variations of model parameters,

for example by sampling from a Bayesian posterior, or refitting the model to new data.

Robustness to data variations, whether in- or out-of-distribution, is generally considered a positive. In contrast, robustness to model variations (what might more commonly be described as insensitivity to model parameters) is often attributed to excess model complexity and thus considered a negative.

Exactly which type of variation a criterion accounts for — and just as importantly, how it does so — defines what we might call the paradigm for that criterion. In Fig. 2.8 we compare the EMD approach to five other commonly used criteria with a variety of model selection paradigms; we expand on those differences below.

For our purposes the most important of those differences is how a criterion accounts for epistemic uncertainty. Four of the considered criteria do this by comparing parametrised families of models — either explicitly by averaging their performance over a prior (BIC, \(\log \eE\)), or implicitly by refitting the model to each new dataset (MDL, AIC). In other words, such criteria compare the mathematical structure of different models, rather than specific parameters; they would not be suited to the neural circuit example of earlier sections, where the candidate models were solely distinguished by their parameters. The comparison we do here is therefore only possible when the models being compared are structurally distinct.

Our comparison is also tailored to highlight features of particular importance for experimental data and which differentiate our method. It is by no means exhaustive, and many aspects of model comparison are omitted — such as their consistency or how they account for informative priors. The example data were also chosen to be in a regime favourable to all models; this is partly why many panels show similar behaviour.

../../_images/compare-other-criteria-asymptotics.svg — Fig. 2.8 **Behaviour of model selection criteria as a function of dataset size** Datasets were generated using Eq. 2.34; they differ only in their number of samples and random seed; an example is shown in the top right. The comparison outcome between two models is given by the difference between their criteria; these are plotted as differences w.r.t. an arbitrary reference model (blue curve). The log evidence (\(\log \eE\)) and \(\elpd\) provide uncertainties, shown either as shading or as error bars. MDL curves are shorter because computations exceeding 100 hours were aborted. [source]#

We can broadly classify criteria according to whether they select models based solely on their predictive accuracy, or whether they also penalize their complexity (so as to prefer simpler models).

Let us begin with the latter. The most common of these is likely the Bayes factor, formally defined as the ratio of model probabilities and in practise the ratio of marginal likelihoods, or model evidence [32, 46, 47]:

(2.36)#\[B^{\text{Bayes}}_{AB} = \frac{p(\M_A \mid \D)} {p(\M_B \mid \D)} = \frac{\int p(\D \mid θ, \M_A) π_A(θ) dθ} {\int p(\D \mid θ, \M_B) π_B(θ) dθ} \eqqcolon \frac{\eE_A} {\eE_B} \,.\]

(We previously defined the evidence \(\eE_A\) the same way in Eq. 2.4.) Here \(p(\D \mid θ, \M_A)\) and \(π_A\) are likelihood and prior distributions respectively. Going forward, instead of Bayes factors we will consider each model’s log evidence: the former are easily recovered from the difference of the latter,

\[\log B^{\text{Bayes}}_{AB} = \log \eE_A - \log \eE_B \,,\]

and this makes it easier to compare multiple models at once. It also matches the conventional format of information criteria. Numerical evaluation of the model evidence is non-trivial, but generic software solutions are available. For this comparison we used an implementation of nested sampling [48] provided by the Python package dynesty [49].

The model evidence corresponds to studying variations of the model (represented by the prior) for a fixed dataset \(\D\). A model is penalised if its prior is unnecessarily broad or high-dimensional; this is the so-called “Occam’s razor” principle which favours simpler models [46, 47] . This sometimes leads to undesirable results when priors are non-informative [22, §6].

The Bayesian information criterion (BIC) is an asymptotic approximation of the log evidence, either around the maximum likelihood or maximum a posteriori estimate [32, 36]. It will approach the log evidence (\(\log \eE\)) when the likelihood is non-singular and approximately Gaussian around its maximum.

Another approach which penalizes complexity is the minimum description length (MDL). This is more accurately described as a methodological framework built around the choice of a “universal probability distribution” [50]; usually however the normalised maximum likelihood (NML) distribution is implied, and this is what we do here. Assuming equal prior preference for each model, MDL selects the one with the highest NML probability; equivalently, it computes for each model the quantity

(2.37)#\[\MDL_A \coloneqq -\max_θ p(\D\mid\M_A(θ)) + \comp(\M_A) \,.\]

and selects the model for which \(\MDL_A\) is minimal. (It is assumed here that \(\M_A\) is parametrised, and \(\M_A(θ)\) denotes the model \(\M_A\) with its parameters set to the values \(θ\).) The first term is the maximum of the likelihood evaluated on the training data. The second term is the MDL complexity, defined by integrating over the entire event space:

(2.38)#\[\comp(\M_A) \coloneqq \int \max_θ p(\D'\mid\M_A(θ)) \,d\D' \,.\]

In other words, MDL complexity quantifies the ability of a model to fit arbitrary data. Note that it does not matter here what data-generating process \(\Mtrue\) generated the data since each possible dataset is given equal weight. Approximations are almost always required to compute Eq. 2.38 since the integral is usually intractable; for this comparison, we chose a data regime where it is just about possible to enumerate the possible datasets and estimate \(\comp(\M_A)\) directly with Monte Carlo.

With MDL therefore we consider variations of the models, since each model is refitted to each dataset \(\D'\). We have no in-distribution dataset variations (the true process \(\Mtrue\) is ignored), but we do have out-of-distribution variations in the form of the integral over the event space.

Moving on now to criteria which compare predictive accuracy, we start with the expected log pointwise predictive density (elpd) [33]. This is defined as the expectation of the loss \(Q\) for \(L\) samples drawn from the data-generating process:

(2.39)#\[\begin{align} \label{eq_def_elpd} \elpd_A &:= \sum_{i=1}^L \EE_{(x_i, y_i) \sim \Mtrue}\bigl[Q(x_i, y_i; \M_A)\bigr] \\ &~= L\,R_A \,. \notag \end{align}\]

The most common form of the \(\elpd\) uses the (negative) log likelihood as the loss function \(Q\). In this form where each sample is drawn from the same distribution, the \(\elpd\) is proportional to the risk \(R_A\) defined in Eq. 2.2, and therefore conceptually analogous: it describes robustness to in-distribution variations of the dataset. The models are kept fixed in these comparisons; they are not refitted to the new data. The uncertainty on \(\elpd\) comes from the finite number of samples used to compute the expectation.

Since in current practice the \(\elpd\) is mostly used with Bayesian models, this is also what we do in Fig. 2.8. Efficient estimators exist for computing the elpd from training data, using either leave-one-out cross-validation [33] or the widely applicable information criterion (WAIC) [51]. This avoids the need to reserve a separate test dataset. We used ArviZ’s [52] implementation of WAIC for this comparison.

In contrast to the \(\elpd\), the Akaike information criterion (AIC) does not assume a fixed model. Although it also ranks models according to their expected loss, this time the expectation is (conceptually) over both the test set \(\D_\test\) and training data \(\D_\train\), both drawn independently from the data-generating process [34, 35]:

\[\elpd^{\AIC}_A = \EE_{\D_\test \sim \Mtrue} \EE_{\D_\train \sim \Mtrue} \Bigl[ Q\bigl(\D_\test; \M_A\bigl(\hat{θ}_{\D_\train}\bigr)\bigr) \Bigr] \,,\]

where \(\hat{θ}_{\D_\train}\) is the maximum likelihood estimate given the training data. Concretely however the \(\AIC\) is a simple statistic, similar to the BIC and other information criteria:

\[\AIC_A \coloneqq 2\,Q\bigl(\D_\train; \M_A(\hat{θ}\bigl(\D_\train\bigr)\bigr) + 2k_A \,,\]

where \(k_A\) is the number of parameters of model \(\M_A\). The derivation of the \(\AIC\) relies on \(Q\) being the log likelihood, and assumes that the likelihood is approximately Gaussian and non-singular around \(\hat{θ}_{\D_\train}\). When these conditions are verified, then \(Δ\AIC_{AB} := \AIC_A - \AIC_B\) is an unbiased estimator of the difference \(\elpd^{\AIC}_A - \elpd^{\AIC}_B\). In this case, the model with the lowest \(\AIC\) will — in expectation — also be the one with the lowest expected risk.

So how does the \(\Bemd{}\) compare to the criteria listed above? First, it computes \(R\)-distributions for fixed models. This is a key feature, since otherwise it is impossible to compare different parameter vectors for otherwise structurally identical models. (Note that EMD also allows models to have different structure — it simply does not require it.) This also means that the \(\Bemd{}\) does not penalise model complexity, in contrast to the evidence, MDL and AIC.

Second, like the \(\elpd\), the \(\Bemd{}\) accounts for in-distribution variability by estimating the risk using held-out test data \(\D_\test\).

Third, EMD \(R\)-distributions also account for out-of-distribution variability. Conceptually this is done via the epistemic distributions \(Ω\), concretely via the HB process on PPFs — the variance of the latter stemming ultimately from the discrepancy function \(δ^{\EMD}\). Better models therefore result in less variability, reflecting our better grasp of the process being modelled. In contrast, only the MDL allows for some form of out-of-distribution variability (in the form of a complexity penalty), and it is essentially rigid: always an integral over the entire event space.

The biggest difference between the EMD \(R\)-distributions and other criteria however is their behaviour as the number of samples \(L\) increases. This we illustrate in the lower part of Fig. 2.8, which plots the value of a criterion for different models as a function of \(L\). We do this for three collections of models, where the data-generating process is always the Planck model \(\Bspec_\mathrm{P}\) described in the previous section. In all panels, scores are defined so that models with lower scores are preferred.

In the first row we compare four candidate models of the form

\[\Bspec \mid λ, T, σ, \{b_j\} \sim \nN \,\biggl(\Bspec_{\mathrm{P}}(λ; T) + \sum_{j=0}^m b_j λ^j,\; σ^2 \Bspec_{\mathrm{P}}(λ; T) \biggr) \,.\]

The candidates are therefore nested: any model within the class \(\M_{m=2}\) can recovered from the class \(\M_{m=4}\) by fixing \(b_3=b_4=0\). Moreover, all models effectively contain \(\Mtrue\), since it corresponds to \(m=0\). (The data are in a regime where the Poisson distribution of \(\Mtrue\) is close to Gaussian.)

The second row is similar, but expands around the Rayleigh-Jeans model instead:

\[\Bspec \mid λ, T, σ, \{b_j\} \sim \nN \,\biggl(\Bspec_{\mathrm{RJ}}(λ; T) + \sum_{j=0}^m b_j λ^j,\; σ^2 \Bspec_{\mathrm{RJ}}(λ; T) \biggr) \,.\]

In this case all models are misspecified, but those with higher degree polynomials are better able to compensate for this and thus achieve lower prediction errors.

The third row compares the two \(m=0\) models from the first two rows.

If we interpret the score difference between two models (measured as the horizontal distance between two curves) as the strength of the evidence supporting a particular model choice, then it is clear from Fig. 2.8 that for all criteria except EMD, strength grows unbounded as we increase the number of samples \(L\). A bounded strength however is desirable when analyzing experimental data, since no experiment has infinite accuracy.

In contrast, we see that the \(R\)-distributions in the grey shaded area stabilize as a function of \(L\). Therefore also the tail probabilities \(\Bemd{ab;c} = P(R_A < R_B \,:\, c)\) (Eq. 2.13), which define the EMD rejection rule, converge in general to finite, non-zero values. In some cases the \(R\)-distributions may approach a Dirac delta as \(L\) increases, as we see in the first row of Fig. 2.8. This occurs when the candidate model is able to match \(\Mtrue\) exactly.

We provide a different set of comparisons in the Supplementary Results, listing numerical values of criteria differences in Table 6.1 to complement the graphical presentation of Fig. 2.8. These alternative comparisons place greater emphasis on how misspecification can affect a criterion’s value, and how that value can (mis)represent the strength of the evidence.