2. Results#

\(\require{mathtools} \newcommand{\notag}{} \newcommand{\tag}{} \newcommand{\label}[1]{} \newcommand{\sfrac}[2]{#1/#2} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\num}[1]{#1} \newcommand{\qty}[2]{#1\,#2} \renewenvironment{align} {\begin{aligned}} {\end{aligned}} \renewenvironment{alignat} {\begin{alignedat}} {\end{alignedat}} \newcommand{\pdfmspace}[1]{} % Ignore PDF-only spacing commands \newcommand{\htmlmspace}[1]{\mspace{#1}} % Ignore PDF-only spacing commands \newcommand{\scaleto}[2]{#1} % Allow to use scaleto from scalerel package \newcommand{\RR}{\mathbb R} \newcommand{\NN}{\mathbb N} \newcommand{\PP}{\mathbb P} \newcommand{\EE}{\mathbb E} \newcommand{\XX}{\mathbb X} \newcommand{\ZZ}{\mathbb Z} \newcommand{\QQ}{\mathbb Q} \newcommand{\fF}{\mathcal F} \newcommand{\dD}{\mathcal D} \newcommand{\lL}{\mathcal L} \newcommand{\gG}{\mathcal G} \newcommand{\hH}{\mathcal H} \newcommand{\nN}{\mathcal N} \newcommand{\pP}{\mathcal P} \newcommand{\BB}{\mathbb B} \newcommand{\Exp}{\operatorname{Exp}} \newcommand{\Binomial}{\operatorname{Binomial}} \newcommand{\Poisson}{\operatorname{Poisson}} \newcommand{\linop}{\mathcal{L}(\mathbb{B})} \newcommand{\linopell}{\mathcal{L}(\ell_1)} \DeclareMathOperator{\trace}{trace} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Span}{span} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\col}{col} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\gt}{>} \definecolor{highlight-blue}{RGB}{0,123,255} % definition, theorem, proposition \definecolor{highlight-yellow}{RGB}{255,193,7} % lemma, conjecture, example \definecolor{highlight-orange}{RGB}{253,126,20} % criterion, corollary, property \definecolor{highlight-red}{RGB}{220,53,69} % criterion \newcommand{\logL}{\ell} \newcommand{\eE}{\mathcal{E}} \newcommand{\oO}{\mathcal{O}} \newcommand{\defeq}{\stackrel{\mathrm{def}}{=}} \newcommand{\Bspec}{\mathcal{B}} % Spectral radiance \newcommand{\X}{\mathcal{X}} % X space \newcommand{\Y}{\mathcal{Y}} % Y space \newcommand{\M}{\mathcal{M}} % Model \newcommand{\Tspace}{\mathcal{T}} \newcommand{\Vspace}{\mathcal{V}} \newcommand{\Mtrue}{\mathcal{M}_{\mathrm{true}}} \newcommand{\MP}{\M_{\mathrm{P}}} \newcommand{\MRJ}{\M_{\mathrm{RJ}}} \newcommand{\qproc}{\mathfrak{Q}} \newcommand{\D}{\mathcal{D}} % Data (true or generic) \newcommand{\Dt}{\tilde{\mathcal{D}}} \newcommand{\Phit}{\widetilde{\Phi}} \newcommand{\Phis}{\Phi^*} \newcommand{\qt}{\tilde{q}} \newcommand{\qs}{q^*} \newcommand{\qh}{\hat{q}} \newcommand{\AB}[1]{\mathtt{AB}~\mathtt{#1}} \newcommand{\LP}[1]{\mathtt{LP}~\mathtt{#1}} \newcommand{\NML}{\mathrm{NML}} \newcommand{\iI}{\mathcal{I}} \newcommand{\true}{\mathrm{true}} \newcommand{\dist}{D} \newcommand{\Mtheo}[1]{\mathcal{M}_{#1}} % Model (theoretical model); index: param set \newcommand{\DL}[1][L]{\mathcal{D}^{(#1)}} % Data (RV or generic) \newcommand{\DLp}[1][L]{\mathcal{D}^{(#1')}} % Data (RV or generic) \newcommand{\DtL}[1][L]{\tilde{\mathcal{D}}^{(#1)}} % Data (RV or generic) \newcommand{\DpL}[1][L]{{\mathcal{D}'}^{(#1)}} % Data (RV or generic) \newcommand{\Dobs}[1][]{\mathcal{D}_{\mathrm{obs}}^{#1}} % Data (observed) \newcommand{\calibset}{\mathcal{C}} \newcommand{\N}{\mathcal{N}} % Normal distribution \newcommand{\Z}{\mathcal{Z}} % Partition function \newcommand{\VV}{\mathbb{V}} % Variance \newcommand{\T}{\mathsf{T}} % Transpose \newcommand{\EMD}{\mathrm{EMD}} \newcommand{\dEMD}{d_{\mathrm{EMD}}} \newcommand{\dEMDtilde}{\tilde{d}_{\mathrm{EMD}}} \newcommand{\dEMDsafe}{d_{\mathrm{EMD}}^{\text{(safe)}}} \newcommand{\e}{ε} % Model confusion threshold \newcommand{\falsifythreshold}{ε} \newcommand{\bayes}[1][]{B_{#1}} \newcommand{\bayesthresh}[1][]{B_{0}} \newcommand{\bayesm}[1][]{B^{\mathcal{M}}_{#1}} \newcommand{\bayesl}[1][]{B^l_{#1}} \newcommand{\bayesphys}[1][]{B^{{p}}_{#1}} \newcommand{\Bconf}[1]{B^{\mathrm{epis}}_{#1}} \newcommand{\Bemd}[1]{B^{\mathrm{EMD}}_{#1}} \newcommand{\Bconfbin}[1][]{\bar{B}^{\mathrm{conf}}_{#1}} \newcommand{\Bemdbin}[1][]{\bar{B}_{#1}^{\mathrm{EMD}}} \newcommand{\bin}{\mathcal{B}} \newcommand{\Bconft}[1][]{\tilde{B}^{\mathrm{conf}}_{#1}} \newcommand{\fc}{f_c} \newcommand{\fcbin}{\bar{f}_c} \newcommand{\paramphys}[1][]{Θ^{{p}}_{#1}} \newcommand{\paramobs}[1][]{Θ^{ε}_{#1}} \newcommand{\test}{\mathrm{test}} \newcommand{\train}{\mathrm{train}} \newcommand{\synth}{\mathrm{synth}} \newcommand{\rep}{\mathrm{rep}} \newcommand{\MNtrue}{\mathcal{M}^{{p}}_{\text{true}}} \newcommand{\MN}[1][]{\mathcal{M}^{{p}}_{#1}} \newcommand{\MNA}{\mathcal{M}^{{p}}_{Θ_A}} \newcommand{\MNB}{\mathcal{M}^{{p}}_{Θ_B}} \newcommand{\Me}[1][]{\mathcal{M}^ε_{#1}} \newcommand{\Metrue}{\mathcal{M}^ε_{\text{true}}} \newcommand{\Meobs}{\mathcal{M}^ε_{\text{obs}}} \newcommand{\Meh}[1][]{\hat{\mathcal{M}}^ε_{#1}} \newcommand{\MNa}{\mathcal{M}^{\mathcal{N}}_a} \newcommand{\MeA}{\mathcal{M}^ε_A} \newcommand{\MeB}{\mathcal{M}^ε_B} \newcommand{\Ms}{\mathcal{M}^*} \newcommand{\MsA}{\mathcal{M}^*_A} \newcommand{\MsB}{\mathcal{M}^*_B} \newcommand{\Msa}{\mathcal{M}^*_a} \newcommand{\MsAz}{\mathcal{M}^*_{A,z}} \newcommand{\MsBz}{\mathcal{M}^*_{B,z}} \newcommand{\Msaz}{\mathcal{M}^*_{a,z}} \newcommand{\MeAz}{\mathcal{M}^ε_{A,z}} \newcommand{\MeBz}{\mathcal{M}^ε_{B,z}} \newcommand{\Meaz}{\mathcal{M}^ε_{a,z}} \newcommand{\zo}{z^{0}} \renewcommand{\lL}[2][]{\mathcal{L}_{#1|{#2}}} % likelihood \newcommand{\Lavg}[2][]{\mathcal{L}^{/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLphys}[2][]{\mathcal{L}^{{p}}_{#1|#2}} \newcommand{\Lavgphys}[2][]{\mathcal{L}^{{p}/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLL}[3][]{\mathcal{L}^{(#3)}_{#1|#2}} \newcommand{\lLphysL}[3][]{\mathcal{L}^{{p},(#3)}_{#1|#2}} \newcommand{\lnL}[2][]{l_{#1|#2}} % Per-sample log likelihood \newcommand{\lnLt}[2][]{\widetilde{l}_{#1|#2}} \newcommand{\lnLtt}{\widetilde{l}} % Used only in path_sampling \newcommand{\lnLh}[1][]{\hat{l}_{#1}} \newcommand{\lnLphys}[2][]{l^{{p}}_{#1|#2}} \newcommand{\lnLphysL}[3][]{l^{{p},(#3)}_{#1|#2}} \newcommand{\Elmu}[2][1]{μ_{{#2}}^{(#1)}} \newcommand{\Elmuh}[2][1]{\hat{μ}_{{#2}}^{(#1)}} \newcommand{\Elsig}[2][1]{Σ_{{#2}}^{(#1)}} \newcommand{\Elsigh}[2][1]{\hat{Σ}_{{#2}}^{(#1)}} \newcommand{\pathP}{\mathop{{p}}} % Path-sampling process (generic) \newcommand{\pathPhb}{\mathop{{p}}_{\mathrm{Beta}}} % Path-sampling process (hierarchical beta) \newcommand{\interval}{\mathcal{I}} \newcommand{\Phiset}[1]{\{\Phi\}^{\small (#1)}} \newcommand{\Phipart}[1]{\{\mathcal{I}_Φ\}^{\small (#1)}} \newcommand{\qhset}[1]{\{\qh\}^{\small (#1)}} \newcommand{\Dqpart}[1]{\{Δ\qh_{2^{#1}}\}} \newcommand{\LsAzl}{\mathcal{L}_{\smash{{}^{\,*}_A},z,L}} \newcommand{\LsBzl}{\mathcal{L}_{\smash{{}^{\,*}_B},z,L}} \newcommand{\lsA}{l_{\smash{{}^{\,*}_A}}} \newcommand{\lsB}{l_{\smash{{}^{\,*}_B}}} \newcommand{\lsAz}{l_{\smash{{}^{\,*}_A},z}} \newcommand{\lsAzj}{l_{\smash{{}^{\,*}_A},z_j}} \newcommand{\lsAzo}{l_{\smash{{}^{\,*}_A},z^0}} \newcommand{\leAz}{l_{\smash{{}^{\,ε}_A},z}} \newcommand{\lsAez}{l_{\smash{{}^{*ε}_A},z}} \newcommand{\lsBz}{l_{\smash{{}^{\,*}_B},z}} \newcommand{\lsBzj}{l_{\smash{{}^{\,*}_B},z_j}} \newcommand{\lsBzo}{l_{\smash{{}^{\,*}_B},z^0}} \newcommand{\leBz}{l_{\smash{{}^{\,ε}_B},z}} \newcommand{\lsBez}{l_{\smash{{}^{*ε}_B},z}} \newcommand{\LaszL}{\mathcal{L}_{\smash{{}^{*}_a},z,L}} \newcommand{\lasz}{l_{\smash{{}^{*}_a},z}} \newcommand{\laszj}{l_{\smash{{}^{*}_a},z_j}} \newcommand{\laszo}{l_{\smash{{}^{*}_a},z^0}} \newcommand{\laez}{l_{\smash{{}^{ε}_a},z}} \newcommand{\lasez}{l_{\smash{{}^{*ε}_a},z}} \newcommand{\lhatasz}{\hat{l}_{\smash{{}^{*}_a},z}} \newcommand{\pasz}{p_{\smash{{}^{*}_a},z}} \newcommand{\paez}{p_{\smash{{}^{ε}_a},z}} \newcommand{\pasez}{p_{\smash{{}^{*ε}_a},z}} \newcommand{\phatsaz}{\hat{p}_{\smash{{}^{*}_a},z}} \newcommand{\phateaz}{\hat{p}_{\smash{{}^{ε}_a},z}} \newcommand{\phatseaz}{\hat{p}_{\smash{{}^{*ε}_a},z}} \newcommand{\Phil}[2][]{Φ_{#1|#2}} % Φ_{\la} \newcommand{\Philt}[2][]{\widetilde{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philhat}[2][]{\hat{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philsaz}{Φ_{\smash{{}^{*}_a},z}} % Φ_{\lasz} \newcommand{\Phileaz}{Φ_{\smash{{}^{ε}_a},z}} % Φ_{\laez} \newcommand{\Philseaz}{Φ_{\smash{{}^{*ε}_a},z}} % Φ_{\lasez} \newcommand{\mus}[1][1]{μ^{(#1)}_*} \newcommand{\musA}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\SigsA}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\musB}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\SigsB}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\musa}[1][1]{μ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Sigsa}[1][1]{Σ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Msah}{{\color{highlight-red}\mathcal{M}^{*}_a}} \newcommand{\Msazh}{{\color{highlight-red}\mathcal{M}^{*}_{a,z}}} \newcommand{\Meah}{{\color{highlight-blue}\mathcal{M}^{ε}_a}} \newcommand{\Meazh}{{\color{highlight-blue}\mathcal{M}^{ε}_{a,z}}} \newcommand{\lsazh}{{\color{highlight-red}l_{\smash{{}^{*}_a},z}}} \newcommand{\leazh}{{\color{highlight-blue}l_{\smash{{}^{ε}_a},z}}} \newcommand{\lseazh}{{\color{highlight-orange}l_{\smash{{}^{*ε}_a},z}}} \newcommand{\Philsazh}{{\color{highlight-red}Φ_{\smash{{}^{*}_a},z}}} % Φ_{\lasz} \newcommand{\Phileazh}{{\color{highlight-blue}Φ_{\smash{{}^{ε}_a},z}}} % Φ_{\laez} \newcommand{\Philseazh}{{\color{highlight-orange}Φ_{\smash{{}^{*ε}_a},z}}} % Φ_{\lasez} \newcommand{\emdstd}{\tilde{σ}} \DeclareMathOperator{\Mvar}{Mvar} \DeclareMathOperator{\AIC}{AIC} \DeclareMathOperator{\epll}{epll} \DeclareMathOperator{\elpd}{elpd} \DeclareMathOperator{\MDL}{MDL} \DeclareMathOperator{\comp}{COMP} \DeclareMathOperator{\Lognorm}{Lognorm} \DeclareMathOperator{\erf}{erf} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\Image}{Image} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\SE}{SE} % standard error \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\SkewNormal}{SkewNormal} \DeclareMathOperator{\TruncNormal}{TruncNormal} \DeclareMathOperator{\Exponential}{Exponential} \DeclareMathOperator{\exGaussian}{exGaussian} \DeclareMathOperator{\IG}{IG} \DeclareMathOperator{\NIG}{NIG} \DeclareMathOperator{\Gammadist}{Gamma} \DeclareMathOperator{\Lognormal}{Lognormal} \DeclareMathOperator{\Beta}{Beta} \newcommand{\sinf}{{s_{\infty}}}\)

2.1. The risk as a selection criterion for already fitted models#

As set out in the Introduction, we start from the assumption that models of the natural phenomenon of interest have already been fitted to data; this anterior step is indicated by the grey faded rectangle in Fig. 1.1. The manner in which this is done is not important; model parameters for example could result from maximizing the likelihood, running a genetic algorithm or performing Bayesian inference.

Our starting points are the true data-generating process \(\Mtrue\) and the pointwise loss \(Q\). The loss depends on the model \(\M_A\), evaluates on individual data samples \((x_i, y_i)\) and returns a real number:

(2.1)#\[Q(x_i, y_i; \M_A) \in \RR \,.\]

Often, but not always, the negative log likelihood is chosen as the loss. What we seek to estimate is the risk, i.e. the expectation of \(Q\):

(2.2)#\[R_A = \EE_{(x_i, y_i)\sim\Mtrue}[Q(x_i,y_i; \M_A)];\]

we use the notation \((x_i, y_i)\sim\Mtrue\) to indicate that the samples \((x_i, y_i)\) are drawn from \(\Mtrue\). For simplicity, our notation assumes that model parameters are point estimates. For Bayesian models, one can adapt the definition of the loss \(Q\) to include an expectation over the posterior; Eq. 2.2 would then become the Bayes risk. Our methodology is agnostic to the precise definition of \(Q\), as long as it is evaluated pointwise; it can differ from the objective used to fit the models.

The risk describes the expected performance of a model on replicates of the dataset, when each replicate is drawn from the same data-generating process \(\Mtrue\). It is therefore a natural basis for comparing models based on their ability to generalise. It is also the gold standard objective for a machine learning algorithm [25, 26], for the same reason.

A common consideration with model selection criteria is whether they are consistent, i.e. whether they eventually select the true model when the number of samples grows. When models are misspecified, this is usually expressed as whether a criterion selects the model with the lowest risk [27, 28]; a criterion which selects the model with lowest risk is therefore consistent by definition. In practice of course we usually estimate \(R\) from finite samples with the empirical risk:

(2.3)#\[\hat{R}_A = \frac{1}{\lvert \D \rvert} \sum_{(x_i, y_i) \in \D} Q(x_i, y_i; \M_A) \,.\]

The principle of empirical risk minimisation then ensures that \(\hat{R}_A\) converges to \(R_A\) as \(\lvert \D \rvert\) approaches infinity [25]. As a consequence, a criterion which selects the model with the lowest empirical risk is also consistent. (We can ensure that it is also consistent in the original sense of asympotically selecting the true model by choosing a loss function \(Q\) which is proper [29]. The log likelihood is known to be proper [30, Chap. 7.1].)

Convergence to \(R_A\) also means that the result of a risk-based criterion is insensitive to the size of the dataset, provided we have enough data to estimate the risk of each model accurately. This is especially useful in the scientific context, where we want to select models which can be used on datasets of any size.

In addition to defining a consistent, size-insensitive criterion, risk also has the important practical advantage that it remains informative when the models are structurally identical. We contrast this with the marginal likelihood (also known as the model evidence), which for some dataset \(\D\), model \(\M_A\) parametrised by \(θ\), and prior \(π_A\), would be written

\[\eE_A = \int p\bigl(\,\D \bigm| \M_A(θ)\,\bigr) \,π_A(θ)dθ \,.\]

Since it is integrated over the entire parameter space, \(\eE_A\) characterizes the family of models \(\M_A(\cdot)\), but says nothing of a specific parametrisation \(\M_A(θ)\). We include a comparison of our proposed EMD rejection rule to other common selection criteria, including the model evidence, at the end of our Results.

A criterion based on risk however will not account for a model which overfits the data, which is why it is important to use separate datasets for fitting and comparing models. This in any case better reflects the usual scientific paradigm of recording data, forming a hypothesis, and then recording new data (either in the same laboratory or a different one) to test that hypothesis. Splitting data into training and test sets is also the standard practice in machine learning to avoid overfitting and ensure better generalisation. In the following therefore we will always denote the observation data as \(\D_\test\) to stress that they should be independent from the data to which the models were fitted.

In short, the EMD rejection rule we develop in the following sections ranks models based on their empirical risk Eq. 2.3, augmented with an uncertainty represented as a risk-distribution. Comparisons based on the risk are consistent and insensitive to the number of test samples, but require a separate test dataset to avoid overfitting.

2.2. Example application: Selecting among disparate parameter sets of a biophysical model#

To explain the construction of the EMD rejection rule, we use the dynamical model for a neuron membrane potential described in Prinz et al. [6]. This choice was motivated by the fact that fitting a neuron model is a highly ill-posed problem, and therefore corresponds to the situation we set out in the Introduction: disparate sets of parameters for a structurally fixed model which nevertheless produce similar outputs. We will focus on the particular LP neuron type; Prinz et al. find five distinct parameter sets which reproduce its experimental characteristics. Throughout this work, we reserve LP 1 as the model that generates through simulation the true data \(\D_\test\), and use LP 2 to LP 5 to define the candidate models \(\M_A\) to \(\M_D\) which we compare against \(\D_\test\). We intentionally exclude LP 1 from the candidate parameters to emulate the typical situation where none of the candidate models fit the data perfectly. Visual inspection of model outputs suggests that two candidates (models \(\M_A\) and \(\M_B\)) are more similar to the true model output (Fig. 2.1). We will show that our method not only concords with those observations, but makes the similarity between models quantitative.

It is important to note that we treat the models \(\M_A\) to \(\M_D\) as simply given. We make no assumption as to how they were obtained, or whether they correspond to the maximum of a likelihood. We chose this neural example, where parameters are obtained with a multi-stage, semi-automated grid search, partly to illustrate that our method is agnostic to the fitting procedure.

The possibility of comparing models visually was another factor in choosing this example. Since numbers can be misleading, this allows us to confirm that the method works as expected. Especially for our target application where all candidate models share the same equation structure, a visual validation is key to establishing the soundness of the \(\Bemd{}\), since none of the established model selection like the Bayes factor, AIC or WAIC [30, 31] are applicable. Indeed, these other methods only compare alternative equation structures, not alternative parameter sets. We include a more in-depth comparison to these other methods at the end of the Results.

../_images/prinz_example-traces.svg

Fig. 2.1 Comparison of LP neuron responses The goal is to estimate a good set of candidate models for neuron LP. a) We consider a simple circuit with a known pacemaker neuron (AB) and a post-synaptic neuron of unknown response (LP). b) Output of the AB neuron. This serves as input to the LP neuron. In practice, these data would more likely be provided by an experimental recording of neuron AB, but here we assume it is a simulatable known model for convenience. c) Response of neuron model LP 1 to the input in (b). Together, (b) and © serve as our observed data \(\D_{\test}\). d) Response of neuron models LP 2 to LP 5 to the input in (b). These are our four candidate models of neuron LP. [source]#

The datasets in this example take the form of one-dimensional time series, with the time \(t \in \Tspace\) as the independent variable and the membrane potential \(V^{\LP{\!\!}} \in \Vspace\) as the dependent variable. We denote the space of all possible time-potential tuples \(\Tspace \!\times\! \Vspace\). Model specifics are given in the Methods; from the point of view of model selection, what matters is that we have ways of generating series of these time-potential tuples: either using the true data-generating process (\(\Mtrue\)) or one of the candidate models (\(\M_A\) to \(\M_D\)).

We will assume that the dataset used to evaluate models is composed of \(L\) samples, with each sample a \((t, \tilde{V}^{\LP{\!\!}})\) tuple:

\[\D_{\test} = \Bigl\{ \bigl(t_k, \tilde{V}^{\LP{\!\!}}(t_k; \Mtrue, I_{\mathrm{ext}})\bigr) \in \Tspace \!\times\! \Vspace \Bigr\}_{k=1}^L \,.\]

The original model by Prinz et al. [6] produced deterministic traces \(V^{\LP{\!\!}}\). Experimental measurements however are variable, and our approach depends on that variability. For this work we therefore augment the model with two sources of stochasticity. First, the system as a whole receives a coloured noise input \(I_{\mathrm{ext}}\), representing an external current received from other neurons. (External currents may also be produced by the experimenter, to help model the underlying dynamics.) Second, we think of the system as the combination of a biophysical model – described by Eq. 4.1, Eq. 4.2 and Eq. 4.3 – and an observation model which adds Gaussian noise. The observation model represents components which don’t affect the biophysics – like noise in the recording equipment – and can be modeled to a first approximation as:

(2.4)#\[\tilde{V}^{\LP{\!\!}}(t_k; \Mtrue) = \underbrace{V^{\LP{\!\!}}\bigl(t_k, I_{\mathrm{ext}}(t_k; τ, σ_i); \Mtrue\bigr)}_{\text{biophysical model}} + \underbrace{ξ(t_k; σ_o)}_{\mathclap{\text{observation model}}} \,.\]

The parameters of the external input \(I_{\mathrm{ext}}\) are \(τ\) and \(σ_i\), which respectively determine its autocorrelation time and strength (i.e. its amplitude). The observation model has only one parameter, \(σ_o\), which determines the strength of the noise. The symbol \(\Mtrue\) is shorthand for everything defining the data-generating process, which includes the parameters \(τ\), \(σ_i\) and \(σ_o\). The indices \(i\) and \(o\) are used to distinguish input and output model parameters.

Since the candidate models in this example assume additive observational Gaussian noise, a natural choice for the loss function is the negative log likelihood:

(2.5)#\[\begin{split}\begin{aligned} Q(t_k, \tilde{V}^{\LP{\!\!}}; \M_a) &= -\log p\bigl(\tilde{V}^{\LP{\!\!}} \mid t_k, \M_a\bigr) \,\\ &= \tfrac{1}{2} \log (2π σ_o) - \frac{\bigl(ξ_i\bigr)^2}{2 σ} \,,\\ &= \tfrac{1}{2} \log (2π σ_o) - \frac{\bigl(\tilde{V}^{\LP{\!\!}} - V^{\LP{\!\!}}(t_k; \M_a)\bigr)^2}{2 σ_o} \,. \end{aligned}\end{split}\]

However one should keep in mind that a) it is not necessary to define the loss in terms of the log likelihood, and b) the loss used to compare models need not be the same used to fit the models. For example, if the log likelihood of the assumed observation model is non-convex or non-differentiable, one might use a simpler objective for optimization. Alternatively, one might fit using the log likelihood, but compare models based on global metrics like the interspike interval. Notably, Prinz et al. [6] do not directly fit to potential traces \(\tilde{V}\), but rather use a set of data-specific heuristics and global statistics to select candidate models. In this work we stick with the generic form of Eq. 2.5, which makes no assumptions on the type of data used to fit the models and is thus easier to generalise to different scenarios.

The definition of the risk than follows immediately:

(2.6)#\[R_a \coloneqq \EE_{(t,\tilde{V}^{\LP{\!\!}}) \sim \Mtrue} \bigl[ Q(t,\tilde{V}^{\LP{\!\!}} ; \M_a)\bigr] \,.\]

Here \(a\) stands for one of the model labels \(A\), \(B\), \(C\) or \(D\). The notation \((t,\tilde{V}^{\LP{\!\!}}) \sim \Mtrue\) denotes that the distribution from which the samples \((t,\tilde{V}^{\LP{\!\!}})\) are drawn is \(\Mtrue\). We can think of \(\Mtrue\) as producing an infinite sequence of data points by integrating the model dynamics and appyling Eq. 2.4 multiple times, or by going to the laboratory and performing the experiment multiple times.

As alluded to above, the candidate model traces in Fig. 2.1d suggest two groups of models: the \(\M_A\) and \(\M_B\) models seem to better reproduce the data than \(\M_C\) or \(\M_D\). Within each group however, it is hard to say whether one model is better than the other; in terms of the risks, this means that we expect \(R_A, R_B < R_C, R_D\). This also means that we should be wary of a selection criterion which unequivocally ranks \(\M_A\) better than \(\M_B\), or \(\M_C\) better than \(\M_D\).

In other words, we expect that uncertainties on \(R_A\) and \(R_B\) should be at least commensurate with the difference \(\lvert R_A - R_B \rvert\), and similarly for the uncertainties on \(R_C\) and \(R_D\).

2.3. A conceptual model for replicate variations#

Evaluating the risk Eq. 2.6 for each candidate model on a dataset \(\D_\test\) yields four scalars \(\hat{R}_A\) to \(\hat{R}_D\); since a lower risk should indicate a better model, a simple naive model selection rule would be

(2.7)#\[\begin{split}\begin{cases} \text{reject }\M_b & \text{if } \hat{R}_a < \hat{R}_b \,,\\ \text{reject }\M_a & \text{if } \hat{R}_a > \hat{R}_b \,,\\ \text{no rejection} & \text{if } \hat{R}_a = \hat{R}_b \,, \end{cases} \quad \text{for } a,b \in \{A,B,C,D\} \,.\end{split}\]

As with many selection criteria (e.g. AIC [32, 33], BIC [34], DIC [35, 36]), this rule is effectively binary: the third option, to reject neither model, has probability zero. It therefore always selects either \(\M_a\) or \(\M_b\), even when the evidence favouring one of the two is extremely weak. Another way to see this is illustrated at the top of Fig. 2.2: the lines representing the four values \(R_A\) through \(R_D\) have no error bars, so even minute differences suffice to rank the models.

../_images/prinz_Rdists.svg

Fig. 2.2 Empirical risk vs. \(R\)-distributions for the four candidate LP models. (top) The empirical risk Eq. 2.3 for each of the four candidate LP models. (bottom) Our proposed \(\Bemd{\!}\) criterion replaces the risk by an \(R\)-distribution, where the spread of each distribution is due to the replication uncertainty for that particular model. \(R\)-distributions are distributions of the \(R\) functional in Eq. 2.16; we estimate them by sampling the quantile function (i.e. inverse cumulative density function) \(q\) according to a stochastic process \(\qproc\) on quantile functions. We used an EMD sensitivity constant of \(c = 2^{-2}\) (see later section on calibration) for \(\qproc\) and drew samples \(\qh \sim \qproc\) until the relative standard error on the risk was below 3 %. A kernel density estimate (KDE) is used to display those samples as distributions. The \(R\)-distribution for the true model is much narrower (approximately Dirac) and far to the left, outside the plotting bounds. [source]#

The problem is that from a scientific standpoint, if the evidence is too weak, this ranking is likely irrelevant – or worse, misinformative. Hence our desire to assign uncertainty to each estimate of the risk. Ideally we would like to be able to compute tail probabilities like \(P(R_A < R_B)\), which would quantify the strength of the evidence in favour of either \(\M_A\) or \(\M_B\). Selecting a minimum evidence threshold \(\falsifythreshold\) would then allow us to convert Eq. 2.7 into a true ternary decision rule with non-zero probability of keeping both models:

(2.8)#\[\begin{split}\begin{cases} \text{reject } \M_b &\text{if } P(R_a < R_b) > \falsifythreshold \,, \\ \text{reject } \M_a &\text{if } P(R_a < R_b) < (1-\falsifythreshold) \,, \\ \text{no rejection} &\text{if } (1-\falsifythreshold) \leq P(R_a < R_b) \leq \falsifythreshold \,. \end{cases}\end{split}\]

As we explain in the Introduction, our goal is to select a model which can describe not just the observed data \(\D_\test\), but also new data generated in a replication experiment. In order for our criterion to be robust, the tail probabilities \(P(R_a < R_b)\) should account for uncertainty in the replication process.

Note that for a given candidate model \(\M_a\) and fixed \(\Mtrue\), we can always estimate the risk with high accuracy if we have enough samples, irrespective of the amount of noise intrinsic to \(\M_a\) or of misspecification between \(\M_a\) and \(\Mtrue\). Crucially however, we also do not assume the replication process to be stationary, so a replicate dataset \(\D_\test'\) may be drawn from a slightly different data-generating process \(\Mtrue'\). This can occur even when \(\Mtrue\) itself is stationary, reflecting the fact that it is often easier to control variability within a single experiments than across multiple ones. Ontologically therefore, the uncertainty on \(\hat{R}\) is a form of epistemic uncertainty arising from the variability across (experimental) replications.

To make this idea concrete, consider that the input \(I_{\mathrm{ext}}\) and noise \(ξ\) might not be stationary over the course of an experiment with multiple trials. We can represent this by making their parameters random variables, for example

(2.9)#\[\begin{split}Ω \coloneqq \left\{\begin{aligned} \log σ_o &\sim \nN(0.0 \mathrm{mV}, (0.5 \mathrm{mV})^2) \\ \log σ_i &\sim \nN(-15.0 \mathrm{mV}, (0.5 \mathrm{mV})^2) \\ \log_{10} τ &\sim \Unif([0.1 \mathrm{ms}, 0.2 \mathrm{ms}]) \end{aligned}\right. \,,\end{split}\]

and drawing new values of \((σ_o, σ_i, τ)\) for each trial (i.e. each replicate). Since it is a distribution over data-generating processes, we call \(Ω\) an epistemic distribution. For illustration purposes, here we have parametrised \(Ω\) in terms of two parameters of the biophysical model and one parameter of the observation model, thus capturing epistemic uncertainty within a single experiment. In general the parametrisation of \(Ω\) is a modelling choice, and may represent other forms of non-stationarity – for example due to variations between experimental setups in different laboratories.

Conceptually, we could estimate the tail probabilities \(P(R_a < R_b)\) by sampling \(J\) different data-generating processes \(\Mtrue^j\) from \(Ω\), for each then drawing a dataset \(\D_\test^j \sim \Mtrue^j\), and finally computing the risks \(\hat{R}_a^j\) and \(\hat{R}_b^j\) on \(\D_\test^j\). The fraction of datasets for which \(\hat{R}_a^j < \hat{R}_b^j\) would then estimate the tail probability:

(2.10)#\[P(R_a < R_b) \approx \frac{1}{J} \,\Bigl\lvert\bigl\{ j \mid \hat{R}_a^j < \hat{R}_b^j \bigr\}_{j=1}^J \Bigr\rvert \,.\]

The issue of course is that we cannot know \(Ω\). First because we only observe data from a single \(\Mtrue\), but also because there may be different contexts to which we want to generalise: one researcher may be interested in modelling an individual LP neuron, while another might seek a more general model which can describe all neurons of this type. These two situations would require different epistemic distributions, with the latter one being in some sense broader.

However we need not commit to a single epistemic distribution: if we have two distributions, and we want to ensure that conclusions hold under both, we can instead use the condition

(2.11)#\[\min_{Ω\in\{Ω_1,Ω_2\}} P_{Ω}(R_a < R_b) > \falsifythreshold\]

to attempt to reject \(\M_b\). In general, considering more epistemic distributions will make the model selection more robust, at the cost of discriminatory power. Epistemic distributions are not therefore prior distributions, since a Bayesian calculation is always tied to a specific choice of prior. (The opposite however holds: a prior can be viewed as a particular choice of epistemic distribution.)

We view epistemic distributions mostly as conceptual tools. For calculations, we will instead propose in the next sections a different type of distribution (\(\qproc\); technically a stochastic process), which is not on the data-generating process, but on the distribution of pointwise losses. Being lower-dimensional and more stereotyped than \(Ω\), we will be able to construct \(\qproc\) entirely non-parametrically, up to a scaling constant \(c\) (c.f. Fig. 1.1 and the section listing desiderata for \(\qproc\)). A later section we will then show, through numerical calibration and validation experiments, that \(\qproc\) also has nice universal properties, so that the only thing that really matters is the overall scale of the epistemic distributions. The constant \(c\) is matched to this scale by numerically simulating epistemic distributions as part of the calibration experiments.

2.4. Model discrepancy as a baseline for non-stationary replications#

To keep the notation in the following sections more general, we use the generic \(x\) and \(y\) as independent and dependent variables. To recover expressions for our neuron example, substitute \(x \to t\), \(y \to \tilde{V}^{\LP{\!\!}}\), \(\X \to \Tspace\) and \(\Y \to \Vspace\). Where possible we also use \(A\) and \(B\) as a generic placeholder for a model label.

Our goal is to define a selection criterion which is robust against variations between experimental replications, but which can be computed using knowledge only of the candidate models and the observed empirical data. To do this, we make the following assumption:

Candidate models represent that part of the experiment which we understand and can replicate.

Under this assumption, more accurate predictions indicate better control of experimental conditions, and thereby less variations across experiments. In the next section we formalise this idea by defining the discrepancy function \(δ^{\EMD}_{A}\), where EMD stands for empirical model discrepancy. In other words we equate epistemic and replication uncertainty. Crucially, epistemic uncertainty will include contributions from misspecification, which do not vanish when the number of samples becomes infinite. (In contrast, contributions due to having finitely many samples do vanish in that limit, as described by Kiureghian and Ditlevsen [13], Hüllermeier and Waegeman [14] and as we illustrate in the Supplementary Discussion.)

We will further assume that the data-generating process \(\Mtrue\) is strictly stationary with finite correlations, i.e. that all observations are identically distributed but may be correlated. For simplicity in fact we treat the samples as i.i.d., since if necessary this could be done by thinning. See Golden [7], Luengo et al. [37] for discussions on constructing estimators from correlated time series.

The function \(δ^{\EMD}_A\) will be scaled by the aforementioned constant \(c\) to determine the variance of the stochastic process \(\qproc_A\) (which we recall induces the \(R_A\)-distribution for the risk of model \(\M_A\)). The scaling parameter \(c \in \RR_+\) allows a practitioner to adjust the sensitivity of the criterion to misspecification: a larger value of \(c\) will emulate more important experimental variations. Since the tail probabilities (Eq. 2.10) we want to compute will depend on \(c\), we will write them

(2.12)#\[\Bemd{AB;c} \coloneqq P(R_A < R_B \mid c)\,.\]
The EMD rejection rule

For a chosen rejection threshold \(\falsifythreshold \in (0.5, 1]\), reject model \(\M_A\) if there exists a model \(\M_B\) such that \(\Bemd{AB;c} < \falsifythreshold\) and \(\hat{R}_A > \hat{R}_B\).

The second condition (\(\hat{R}_A > \hat{R}_B\)) ensures the rejection rule remains consistent, even if the \(R\)-distributions become skewed. (See comment below Eq. 2.3.)

As an illustration, Table 2.1 gives the value of \(\Bemd{}\) for each candidate model pair in our example from Fig. 2.1 and Fig. 2.2. As expected, models that were visually assessed to be similar also have \(\Bemd{}\) values close to \(\tfrac{1}{2}\). In practice one would not necessarily need to compute the entire table, since the \(\Bemd{}\) satisfy dice transitivity [38, 39]. In particular this implies that for any threshold \(\falsifythreshold > \varphi^{-2}\) (where \(\varphi\) is the golden ratio), we have

(2.13)#\[\begin{split}\left. \begin{aligned} \Bemd{AB;c} &> \sqrt{\falsifythreshold} \\ \Bemd{BC;c} &> \sqrt{\falsifythreshold} \\ \end{aligned} \,\right\} \,\Rightarrow\, \Bemd{AC;c} > \falsifythreshold \,.\end{split}\]

Since any reasonable choice of threshold will have \(\falsifythreshold > 0.5 > \varphi^{-2}\), we can say that whenever the comparisons \(\Bemd{AB;c}\) and \(\Bemd{BC;c}\) are both greater than \(\sqrt{\falsifythreshold}\), we can treat them as transitive. We give a more general form of this result in the Supplementary Methods.

Table 2.1 Comparison of \(\Bemd{}\) probabilities for candidate LP models. Values are the probabilities given by Eq. 2.12 with the candidate labels \(a\) and \(b\) corresponding to rows and columns respectively. Candidate models being compared are those of Fig. 2.1. Probabilities are computed for the \(R\)-distributions shown in Fig. 2.2, which used an EMD constant of \(c = 2^{-2}\). Since \(P(R_a < R_b) = 1 - P(R_b < R_a)\), the sum of symmetric entries equals 1.#

A

B

C

D

A

0.500

0.483

0.846

0.821

B

0.517

0.500

0.972

0.940

C

0.154

0.028

0.500

0.463

D

0.179

0.060

0.537

0.500

2.5. \(δ^{\mathrm{EMD}}\): Expressing model-mismatch as a discrepancy between CDFs#

We can treat the loss function for a given model \(\M_A\) as a random variable \(Q(x,y;\M_A)\) where the \((x,y)\) are sampled from \(\Mtrue\). A key realisation for our approach is that the CDF (cumulative distribution function) of the loss suffices to compute the risk. Indeed, we have for the CDF of the loss

(2.14)#\[\begin{split}\begin{aligned} \Phis_A(q) &\coloneqq p\bigl(Q(x,y; \M_A) \leq q \mid x,y \sim \Mtrue \bigr) \notag \\ &= \int_{\X \times \Y} \!\!\! H\bigl(q - Q(x,y; \M_A)\bigr)\,p\bigl(x,y \mid \Mtrue \bigr) \, dx dy \notag \\ &\approx \frac{1}{L} \sum_{x_i,y_i \in \D_\test} H\bigl(q - Q(x_i,y_i; \M_A)\bigr) \,, \end{aligned}\end{split}\]

where \(H\) is the Heaviside function with \(H(q) = 1\) if \(q \geq 0\) and \(0\) otherwise.

Since \(\Mtrue\) is unknown, a crucial feature of Eq. 2.14 is that \(\Phis_A(q)\) can be estimated without needing to evaluate \(p\bigl(x,y \mid \Mtrue \bigr)\); indeed, all that is required is to count the number of observed data points in \(\D_\test\) which according to \(\M_A\) have a loss less than \(q\). Moreover, since \(Q\) returns a scalar, the data points \((x_i, y_i)\) can have any number of dimensions: the number of data points required to get a good estimate of \(\Phis_A(q)\) does not depend on the dimensionality of the data, for the same reason that estimating marginals of a distribution requires much fewer samples than estimating the full distribution. Finally, because the loss is evaluated using \(\M_A\), but the expectation is taken with respect to a distribution determined by \(\Mtrue\), we call \(\Phis_A\) the mixed CDF.

We can invert \(\Phis_A\) to obtain the mixed PPF[1] (percent point function, also known as quantile function, or percentile function):

(2.15)#\[\qs_A(Φ) \coloneqq {\Phis_A}^{-1}(Φ) \,, \qquad(\textit{mixed PPF})\]

which is also a 1-d function, irrespective of the dimensionality of \(\X\) or \(\Y\). We can then rewrite the risk as a one dimensional integral in \(\qs_A\):

(2.16)#\[\begin{split}\begin{aligned} R_A = R[\qs_A] &= \int_0^1\, \qs_A(Φ) \,dΦ \\ &\approx \frac{1}{L} \sum_{x_i,y_i \in \D_\test} Q(x_i, y_i ; \M_A) \,. \end{aligned}\end{split}\]

To obtain Eq. 2.16, we simply used Fubini’s theorem to reorder the integral of Eq. 2.14 and marginalized over all slices of a given loss \(\qs_A(Φ)\). The integral form (first line of Eq. 2.16) is equivalent to averaging an infinite number of samples, and therefore to the (true) risk, whereas the average over observed samples (second line of Eq. 2.16) is exactly the empirical risk defined in Eq. 2.3.

In practice, to evaluate Eq. 2.16, we use the observed samples to compute the sequence of per-sample losses \(\{Q(x_i,y_i;\M_A)\}_{x_i,y_i \in \D_\test}\). This provides us with a sequence of losses, which we use as ordinate values. We then sort this sequence so that we have \(\{Q_i\}_{i=1}^L\) with \(Q_i \leq Q_i+1\), and assign to each the abscissa \(Φ_i = i/L+1\), such that losses are motonically increasing and uniformly distributed on the [0, 1] interval. This yields the empirical PPF of the loss – the “empirical” qualifier referring to this construction via samples, as opposed to an analytic calculation. Interpolating the points then yields a continuous function which can be used in further calculations. All examples in this paper linearly interpolate the PPF from \(2^{10}=1024\) points.

In Fig. 2.3 we show four examples of empirical PPFs, along with their associated empirical CDFs. We see that the statistics of the additive observational noise affects the shape of the PPF: for noise with exponential tails, as we get from Gaussian or Poisson distributions, we have strong concentration around the minimum value of the loss followed by a sharp increase at \(Φ=1\). For heavier-tailed distributions like Cauchy, loss values are less concentrated and the PPF assigns non-negligible probability mass to a wider range of values. The dimensionality of the data also matters. High-dimensional Gaussians are known to place most of their probability mass in a thin shell centered on the mode, and we see this in the fourth column of Fig. 2.3: the sharp increase at \(Φ=0\) indicates that very low probability is assigned to the minimum loss.

Since by construction, the abscissae \(Φ\) of an empirical PPF are spaced at intervals of \(1/L\), the Riemann sum for the integral in Eq. 2.16 reduces to the sample average. More importantly, we can interpret the risk as a functional in \(\qs_A(Φ)\), which will allow us below to define a generic stochastic process that accounts for epistemic uncertainty.

../_images/pedag_CDF_PPF.svg

Fig. 2.3 Loss PPF for different models: each column corresponds to a different model. The PPF (bottom row) is the inverse of the CDF (top row). For calculations we interpolate \(2^{10}=1024\) points (cyan line) to obtain a smooth function; for illustration purposes here only 30 points are shown. The data for the first two columns were generated with the neuron model described at the top of our Results, where the additive noise follows either a Gaussian or Cauchy distribution. The black body radiation data for the third column were generated from a Poisson distribution using Eq. 2.33 with \(s = 2^{14}\) and \(λ\) in the range \(6\ \mathrm{µm}\text{ to }20\ \mathrm{µm}\). Here the true noise is binomial, but the loss assumes a Gaussian. The fourth column shows an example where the data are high-dimensional; the same 30 dimensional, unit variance, isotropic Gaussian is used for both generating the data and evaluating the loss. In all panels the loss function used is the log likelihood under the model. [source]#

Up to this point with Eq. 2.16 we have simply rewritten the usual definition of the risk. Recall now that in the previous section, we proposed to equate replication uncertainty with misspecification; specifically we are interested in how differences between the candidate model \(\M_A\) and the true data-generating process \(\Mtrue\) affect the loss PPF, since this determines the risk. Therefore we also compute the PPF of \(Q(x,y; \M_A)\) under its own model (recall from Eq. 1.1 that \(\M_A\) must be a probabilistic model):

(2.17)#\[\begin{split}\begin{aligned} \Phit_A(q) &\coloneqq p\bigl(Q(x,y; \M_A) \leq q \mid x,y \sim \M_A \bigr) \notag \\ &\,= \int_{\X \times \Y} \!\!\!dxdy\; H\bigl(q - Q(x,y; \M_A) \bigr)\,p\bigl(x,y \mid \M_A \bigr) \notag \\ &\approx \frac{1}{L_{\synth,A}} \sum_{x_i,y_i \in \D_{\synth,A}} H\bigl(q - Q(x_i,y_i; \M_A)\bigr) \,, \end{aligned}\end{split}\]

from which we obtain the PPF:

(2.18)#\[ \begin{align}\begin{aligned}\begin{aligned} \qt_A(Φ) &\coloneqq {\Phit_A}^{-1}(Φ)\,. \qquad(\textit{synth PPF})\\\end{aligned}\end{aligned}\end{align} \]

The only difference between \(\qt_A\) and \(\qs_A\) is the use of \(p(x,y \mid \M_A)\) instead of \(p(x,y \mid \Mtrue )\) in the integral. In practice this integral would also be evaluated by sampling, using \(\M_A\) to generate a dataset \(\D_{\synth,A}\) with \(L_{\synth,A}\) samples. Because in this case the candidate model is used for both generating samples and defining the loss, we call \(\qt_A\) (\(\Phit_A\)) the synthetic PPF (CDF).

The idea is that the closer \(\M_A\) is to \(\Mtrue\), the closer also the synthetic PPF should be to the mixed PPF – indeed, equality of the PPFs (\(\qt_a = \qs_A\)) is a necessary condition for equality of the models (\(\M_A = \Mtrue\)). Therefore we can quantify the uncertainty due to misspecification, at least insofar as it affects the risk, as the absolute difference between \(\qt_A\) and \(\qs_A\):

(2.19)#\[\begin{split}\begin{aligned} δ^{\EMD}_A: [0, 1] &\to \RR \\ Φ &\mapsto \bigl\lvert\, \qt_A(Φ) - \qs_A(Φ) \,\bigr\rvert \,. \end{aligned}\end{split}\]

We refer to \(δ^{\EMD}_A\) as the empirical model discrepancy (EMD) function because it measures the discrepancy between two empirical PPFs.

It is worth noting that even a highly stochastic data-generating process \(\Mtrue\), with a lot of aleatoric uncertainty, still has a well defined PPF – which would be matched by an equally stochastic candidate model \(\M_A\). Therefore the discrepancies measured by \(δ^{\EMD}_A\) are a representation of the epistemic uncertainty. These discrepancies can arise either from a mismatch between \(\M_A\) and \(\Mtrue\), or simply having too few samples to estimate the mixed PPF \(\qs_A\) exactly; either mechanism contributes to the uncertainty on the expected risk of replicate datasets. We include some additional comments on how different types of uncertainties relate to risk in the Supplementary Discussion.

2.6. \(\qproc\): A stochastic process on quantile functions#

When replication is non-stationary, each replicate dataset will produce a different loss PPF. We also argued above for equating uncertainty across replications with epistemic uncertainty, which in turn determines the discrepancy function \(δ^{\EMD}_A\). It seems natural therefore to expect the following:

For a given model \(\M_A\), differences between its PPFs on two replicate datasets should be proportional to \(δ^{\EMD}_A\).

Formalising this idea is made considerably simpler by the fact that PPFs are always scalar functions, irrespective of the model or dataset. We do this as follows, going the steps schematically represented by downward facing arrows on the right of Fig. 1.1.

For a given model \(\M_A\), we treat the PPFs of different replicates as realisations of a stochastic process \(\qproc_A\) on the interval \([0, 1]\); a realisation of \(\qproc_A\) (i.e. a PPF) is a function \(\qh_A(Φ)\) with \(Φ \in [0, 1]\). The stochastic process \(\qproc_A\) is defined to satisfy the desiderata listed below, which ensure that it is centered on the observed PPF \(\qs_A\) and that its variance at each \(Φ\) is governed by \(δ^{\EMD}_A\).

Obtaining the \(R\)-distributions shown in Fig. 2.2 is then relatively straightforward: We simply sample an ensemble of \(\qh_A\) from \(\qproc_A\) and integrate each to obtain an ensemble of risks.

2.6.1. Desiderata for \(\qproc\)#

In order to interpret them as such, realisations of \({\qh_A \sim \qproc_A}\) must valid PPFs. This places quite strong constraints on those realisations, for example:

  • All realisations \({\qh_A(Φ) \sim \qproc_A}\) must be monotone.

  • All realisations \({\qh_A(Φ) \sim \qproc_A}\) must be integrable.

Monotonicity follows immediately from definitions: a CDF is always monotone because it is the integral of a positive function (Eq. 2.14), and therefore its inverse must also be monotone.

Integrability simply means that the integral in Eq. 2.16 exists and is finite. Concretely this is enforced by ensuring that the process \(\qproc_A\) is self-consistent [40], a property which we explain in the Methods.

Interpreting the realisations \(\qh\) as PPFs also imposes a third constraint, more subtle but equally important:

  • The process \(\qproc_A\) must be non-accumulating.

A process which is accumulating would start at one end of the domain, say \(Φ=0\), and sequentially accumulate increments until it reaches the other end. Brownian motion over the interval \([0, T]\) is an example of such a process. In contrast, consider the process of constructing a PPF for the data in Fig. 2.1: initially we have few data points and the PPF of their loss is very coarse. As the number of points increases, the PPF gets refined, but since loss values occur in no particular order, this happens simultaneously across the entire interval.

The accumulation of increments strongly influences the statistics of a process; most notably, the variance is usually larger further along the domain. This would not make sense for a PPF: if \(\VV[\qproc_A(Φ)]\) is smaller than \(\VV[\qproc_A(Φ')]\), that should be a consequence of \(δ^{\EMD}(Φ)\) being smaller than \(δ^{\EMD}(Φ')\) – not of \(Φ\) occurring “before” \(Φ'\).

This idea that a realisation \(\qh_A(Φ)\) is generated simultaneously across the interval led us to define \(\qproc_A\) as a sequence of refinements: starting from an initial increment \(Δ\qh_1(0) = \qh_A(1) - \qh_A(0)\) for the entire \(Φ\) interval \([0, 1]\), we partition \([0, 1]\) into \(n\) subintervals, and sample a set of \(n\) subincrements in such a way that they sum to \(Δ\qh_1(0)\). This type of distribution, where \(n\) random variables are drawn under the constraint of a fixed sum, is called a compositional distribution [41]. Note that the constraint reduces the number of dimensions by one, so a pair of increments would be drawn from a 1-d compositional distribution. A typical 1-d example is the beta distribution for \(x_1 \in [0, 1]\), with \(x_2 = (1-x_1)\) and \(α,β > 0\):

(2.20)#\[\begin{aligned} \text{if }\; & x_1 \sim \Beta(α, β) \,, & \text{ then }\; & p(x_1) \propto x_1^{α-1} (1-x_1)^{β-1}\,. \end{aligned}\]

Interestingly, the most natural statistics for compositional distributions are not the mean and variance, but analogue notions of centre and metric variance [41, 42]; for the beta distribution defined above, these are

(2.21)#\[\begin{split}\begin{aligned} \EE_a[(x_1, x_2)] &= \frac{1}{e^{ψ(α)} + e^{ψ(β)}} \bigl(e^{ψ(α)}, e^{ψ(β)}\bigr) \label{eq_comb-stats__EE} \\ \Mvar[(x_1, x_2)] &= \frac{1}{2} \bigl(ψ_1(α) + ψ_1(β)\bigr) \,, \label{eq_comb-stats__Mvar} \end{aligned}\end{split}\]

where \(ψ\) and \(ψ_1\) are the digamma and trigamma functions respectively, and \(\EE_a\) denotes expectation with respect to the Aitchison measure [41, 43]. In essence, Eq. 2.21 are obtained by mapping \(x_1\) and \(x_2\) to the unbounded domain \(\RR\) via a logistic transformation, then evaluating moments of the unbounded variables. Of particular relevance is that – in contrast to the variance – the metric variance \(\Mvar\) of a compositional distribution is therefore unbounded, which simplifies the selection of \(α\) and \(β\) (see Choosing beta distribution parameters in the Methods).

Of course, we not only want the \(\qh_A\) to be valid PPFs, but also descriptive of the model and data. We express this as two additional constraints, which together define a notion of closeness to \(\qs_A\):

  • At each intermediate point \(Φ \in (0, 1)\), the centre of the process is given by \(\qs_A(Φ)\):

(2.22)#\[\EE_a\bigl[ \qh(Φ) \bigr] = \qs_A(Φ) \,.\]
  • At each intermediate point \(Φ \in (0, 1)\), the metric variance is proportional to the square of \(δ^{\EMD}_A(Φ)\) :

(2.23)#\[\Mvar\bigl[ \qh(Φ) \bigr] = c \, δ^{\EMD}_A(Φ)^2 \,,\]

where \(c > 0\) is the aforementioned sensitivity parameter (also further described in Calibrating and validating the BEMD).

In addition, for reasons of convenience, we also ask that

  • The end points are sampled from Gaussian distributions:

(2.24)#\[\begin{split}\begin{aligned} \qh(0) \sim \nN(\qs_A(0), c \, δ^{\EMD}_A(0)^2) \,, \\ \qh(1) \sim \nN(\qs_A(1), c \, δ^{\EMD}_A(1)^2) \,. \end{aligned}\end{split}\]

Thus the process \(\qproc_{A;c}\) should be parametrised by two functions and a scalar: \(\qs_A\), \(δ^{\EMD}_A\) and \(c\). It should be molded to produce realisations \(\qh\) which as a whole track \(\qs_A\), with more variability between realisations at points \(Φ\) where \(δ^{\EMD}_A(Φ)\) is larger (middle panel of Fig. 2.4a). This tracking however may be imperfect, because we must also satisfy the constraints of monotonicity and non-accumulating increments (lower panel of Fig. 2.4a).

To the best of our knowledge the current literature does not provide a process satisfying all of these constraints. To remedy this situation, we propose a new hierarchical beta (HB) process, which we illustrate in Fig. 2.4. A few example realisations of \({\qh \sim \qproc_A}\) are drawn as grey lines in Fig. 2.4a. The mixed PPF \(\qs_A\) (Eq. 2.22) is drawn as a green line, while the region corresponding to \(\qs_A(Φ) \pm \sqrt{c} δ^{\EMD}_A(Φ)\) is shaded in yellow.

Fig. 2.4b shows distributions of \(\qh(Φ)\) at three different values of \(Φ\). The value of \(\qs_A(Φ)\) is indicated by the green vertical bar and agrees well with \(\EE_a\bigl[ \qh(Φ) \bigr]\); the desideratum of Eq. 2.22 is therefore satisfied. The scaling of these distributions with \(δ^{\EMD}_A\) (Eq. 2.23) is however only approximate, which we can see as the yellow shading not having the same width in each panel. This is a result of the tension with the other constraints: the HB process ensures that the monotonicity and integrability constraints are satisfied exactly, but allows deviations in the statistical constraints.

A realisation of an HB process is obtained by a sequence of refinements; we illustrate three such refinements steps Fig. 2.4d. The basic idea is to refine an increment \(Δ\qh_{ΔΦ}(Φ) := \qh(Φ+ΔΦ) - \qh(Φ)\) into two subincrements \(Δ\qh_{\tfrac{ΔΦ}{2}}(Φ)\) and \(Δ\qh_{\tfrac{ΔΦ}{2}}(Φ+\tfrac{ΔΦ}{2})\), with \(Δ\qh_{\tfrac{ΔΦ}{2}}(Φ) + Δ\qh_{\tfrac{ΔΦ}{2}}(Φ+\tfrac{ΔΦ}{2}) = Δ\qh_{ΔΦ}(Φ)\). To do this, we first determine appropriate parameters \(α\) and \(β\), draw \(x_1\) from the corresponding beta distribution and assign

(2.25)#\[\begin{split}\begin{aligned} Δ\qh_{\tfrac{ΔΦ}{2}}(Φ) &\leftarrow x_1 Δ\qh_{ΔΦ}(Φ) \,, \\ Δ\qh_{\tfrac{ΔΦ}{2}}(Φ+\tfrac{ΔΦ}{2}) &\leftarrow x_2 Δ\qh_{ΔΦ}(Φ) \,, \end{aligned}\end{split}\]

where again \(x_2 = (1-x_1)\). Fig. 2.4c shows distributions for the first (orange) and second (blue) subincrements at the fourth refinement step, where we divide an increment over an interval of length \(ΔΦ=2^{-3}\) to two subincrements over intervals of length \(2^{-4}\). Each pair of subincrements is drawn for a different distribution, which depends on the particular realisation \(\qh\), but we can nevertheless see the PPF reflected in the aggregate distribution: the PPF has positive curvature, so the second subincrement tends to be larger than the first. Also both increments are bounded from below by 0, to ensure monotonicity.

A complete description of the HB process, including a procedure for choosing the beta parameters \(α\) and \(β\) such that our desiderata are satisfied, is given in the Methods.

2.6.2. Using \(\qproc\) to compare models#

Having constructed a process \(\qproc_{A;c}\) for a candidate model \(\M_A\), we can use it to induce a distribution on risks. We do this by generating a sequence of PPFs \(\qh_{A,1}, \qh_{A,2}, \dotsc, \qh_{A,M_A}\), where \(M_A \in \NN\) and each \(\qh_{A,i}\) is drawn from \(\qproc_{A;c}\) (see Fig. 2.4 for examples of sampled \(\qh_{A,i}\), and the Methods for more details on how we evaluate \(\Bemd{}\)). As we explain in the next section, the sensitivity parameter \(c\) is a property of the experiment; it is the same for all candidate models.

For each generated PPF, we evaluate the risk functional (using the integral form of Eq. 2.16), thus obtaining a sequence of scalars \(R\bigl[\qh_{A,1}\bigr], R\bigl[\qh_{A,2}\bigr], \dotsc, R\bigl[\qh_{A,M_A}\bigr]\) which follows \(p(R_A \mid \qproc_A)\). With \(M_A\) sufficiently large, these samples accurately characterize the distribution \(p(R_A \mid \qproc_A)\) (we use \(\leftrightarrow\) to relate equivalent descriptions):

(2.26)#\[\begin{split}\begin{multline} R_A \sim p(R_A \mid \D_\test, \M_A; c) \\ \begin{aligned} &\leftrightarrow \Bigl\{R\bigl[\qh_{A,1}\bigr], R\bigl[\qh_{A,2}\bigr], \dotsc, R\bigl[\qh_{A,M_A}\bigr] \Bigm| \qh_A \sim \qproc_A \Bigr\} \\ &\leftrightarrow \Bigl\{R_{A,1}\,,\;\; R_{A,2}\,,\;\; \dotsc, R_{A,M_A}\Bigr\} \,. \end{aligned} \end{multline}\end{split}\]

Repeating this procedure for a different model \(\M_B\) yields a different distribution for the risk:

\[R_B \sim p(R_B \mid \D_\test, \M_B; c) \;\leftrightarrow\; \Bigl\{R_{B,1}\,, R_{B,2}\,, \dotsc, R_{B,M_B}\Bigr\} \,.\]

The \(\Bemd{AB;c}\) criterion Eq. 2.12 then reduces to a double sum:

(2.27)#\[\begin{split}\begin{aligned} \Bemd{AB;c} &\coloneqq P(R_A < R_B \mid c) \\ &\approx \frac{1}{M_A M_B} \sum_{i=1}^{M_A} \sum_{j=1}^{M_B} \boldsymbol{1}_{R_{A,i} < R_{B,j}}\,. \end{aligned}\end{split}\]

In Eq. 2.27, the term within the sum is one when \({R_{A,i} < R_{B,j}}\) and zero otherwise. A value of \(\Bemd{AB;c}\) greater (less) than 0.5 indicates evidence for (against) model \(\M_A\).

We view the undetermined parameter \(c\) as a way to adjust the sensitivity of the criterion: larger values of \(c\) will typically lead to broader distributions of \(R_A\), and therefore lead to a more conservative criterion (i.e. one which is more likely to result in an equivocal outcome). We give some guidelines on choosing \(c\) as part of the calibration procedure described below.

../_images/pedag_qhat.svg

Fig. 2.4 Sampling a hierarchical beta (HB) process. (a) PPF samples (grey lines) drawn from a hierarchical process \(\qproc\). Mixed (green) and synthetic (red) PPFs are those for the Planck model of Fig. 2.7f (respectively \(\qs\) from Eq. 2.15 and \(\qt\) from Eq. 2.18). Lower panels are enlarged portions of the top panel, corresponding to the black rectangles in the latter. At each \(Φ\), the variance between realisations is controlled by \(\sqrt{c}\, δ^{\EMD}\) (yellow shading), per Eq. 2.23; here we use \(c=0.5\). (b) Marginals of \(\qproc\) at three values of \(Φ\), obtained as histograms of 10,000 realisations \(\qh\). As in (a), the green line indicates the value of \(\qs(Φ)\) and the yellow shading describes the range \(\qs(Φ) \pm \sqrt{c}δ^{\EMD}(Φ)\). Values of \(Φ\) are given alongside the panels and drawn as cyan vertical lines in (a). © Distributions of the subincrements drawn at the same three \(Φ\) positions as in (b), obtained as histograms from the same 10,000 realisations. Subincrements are for the fourth refinement step, corresponding to the fourth panel of (d). Notation in the legend corresponds to Eq. 2.25. (d) Illustration of the refinement process. This \(\qproc\) is parametrised by the same PPFs as the one in (a–c), but we used a larger \(c\) (16) to facilitate visualization. Each refinement step halves the width \(ΔΦ\) of an increment; here we show four refinements steps, while in most calculations we use eight. New points at each steps are coloured green; the beta distribution from which each new point is drawn is shown in grey. The domain of each of these distributions spans the vertical distance between the two neighbouring points and is shown as a grey vertical line. [source]#

2.7. Calibrating and validating the BEMD#

The process \(\qproc\) constructed in the previous section succeeds in producing distributions of the risk. And at least with some values of the scaling constant \(c\), the distributions look as we expect (Fig. 2.2) and define a \(\Bemd{}\) criterion which assigns reasonable tail probabilities (Table 2.1).

To now put the \(\Bemd{}\) criterion on firmer footing, recall that it is meant to model variations between data replications. We can therefore validate the criterion by simulating such variations in silico (in effect simulating many replications of the experiment): since the true data-generating process \(\Mtrue\) is then known, this gives us a ground truth of known comparison outcomes, against which we can test the probabilistic prediction made by \(\Bemd{}\). Because this procedure will also serve to select a suitable value for the scaling constant \(c\), we call it a calibration experiment.

To generate the required variations we use epistemic distributions of the type introduced previously, of which we already gave an example in Eq. 2.9. With this we define an alternative tail probability, (see the Methods for details)

(2.28)#\[\Bconf{AB;Ω} \coloneqq P(R_A < R_B \mid Ω) \,,\]

which uses the epistemic distribution \(Ω\) instead of the process \(\qproc_A\) to represent epistemic uncertainty. We then look for \(c > 0\) such that the criterion \(\Bemd{AB;c}\) a) is correlated with \(\Bconf{AB;Ω}\); and b) satisfies

(2.29)#\[\Bigl\lvert \Bemd{AB;c} - 0.5 \Bigr\rvert \leq \Bigl\lvert \Bconf{AB;Ω} - 0.5 \Bigr\rvert \,.\]

Eq. 2.29 encodes a desire for a conservative criterion: We are most worried about incorrectly rejecting a model, i.e. of making a type I error. The consequences for a type II error (failing to reject either model) are in general less dire: it is an indication that we need better data to differentiate between models. This desire for a conservative criterion reflects a broad belief in science that it is better to overestimate errors than underestimate them.

The advantage of a probability like \(\Bconf{AB;Ω}\) is that it makes explicit which kinds of replication variations are involved in computing the tail probability; the interpretability of Eq. 2.28 is therefore clear. However it can only be computed after generating a large number of synthetic datasets, which requires a known and parametrisable data-generating process \(\Mtrue\); on its own therefore, \(\Bconf{AB;Ω}\) therefore cannot be used directly as a criterion for comparing models. By choosing \(c\) such that the criteria are correlated, we transfer the interpretability of \(\Bconf{AB;Ω}\) onto \(\Bemd{AB;c}\). Moreover, if we can find such a \(c\), then we have de facto validated \(\Bemd{AB;c}\) for the set of simulated experiments described by \(Ω\).

Since defining an epistemic distribution involves making many arbitrary choices, one may want to define multiple distributions \(Ω_1, Ω_2, \dotsc\) to ensure that results are not sensitive to a particular choice of \(Ω\). Eq. 2.29 can easily be generalised to account for this (following a similar logic as Eq. 2.10), in which case it becomes

(2.30)#\[\Bigl\lvert \Bemd{AB;c} - 0.5 \Bigr\rvert \leq \min_{Ω \in \{Ω_1, Ω_2, \dotsc\}\;} \Bigl\lvert \Bconf{AB;Ω} - 0.5 \Bigr\rvert \,.\]

We found that an effective way to verify Eq. 2.29 is by plotting \(\Bconf{AB;Ω}\) against \(\Bemd{AB;c}\), where values of \(\Bconf{AB;Ω}\) are obtained by averaging comparison outcomes, conditioned on the value of \(\Bemd{AB;c}\) being within an interval. We thus obtain a histogram of \(\Bconf{AB;Ω}\) against \(\Bemd{AB;c}\), which works best when the size of bins is adjusted so that they have similar statistical power. We illustrate this in Fig. 2.5, where histograms are shown as curves to facilitate interpretation. These curves are drawn against the “overconfident regions” (where Eq. 2.29 is violated), depicted in red or yellow: therefore we look for values of \(c\) which as much as possible stay within the white regions. The visual representation makes it easier to judge the extent to which small violations of Eq. 2.29 can be tolerated.

../_images/prinz_calibrations.svg

Fig. 2.5 Calibration curves for the LP models of Fig. 2.1. Calibration curves for six epistemic distributions, computed following our proposed calibration procedure. Each curve summarizes 512 simulated experiments with datasets of size 4000, and the six panels explore how curves depend on three parameters: the pair of models being compared, the constant \(c\) and the input strength \(I_{\mathrm{ext}}\). The latter is either \(\texttt{weak}\) (top) or \(\texttt{strong}\) (bottom). Other parameters are kept fixed, namely a \(\texttt{short}\) correlation time \(τ\) and a \(\texttt{Gaussian}\) observation noise with \(\texttt{low}\) standard deviation. (For an extended version where all conditions are tested, see Fig. 6.1.) The regions depicted in red and yellow are those where Eq. 2.29 is violated. [source]#

Fig. 2.5 shows calibration curves for six different epistemic distributions: three model pairs (\(\M_A \,\mathtt{vs}\, \M_B\), \(\M_C \,\mathtt{vs}\, \M_D\), and \(\M_A \,\mathtt{vs}\, \M_D\)), each with \(\mathtt{weak}\) and \(\mathtt{strong}\) external input \(I_{\mathrm{ext}}\). (For full details on the choice of epistemic distribution, see the Methods.) The model pairs were chosen to test three different situations: one where the candidate models are similar both to each other and the observations (\(\M_A \,\mathtt{vs}\, \M_B\)), one where the candidate models are similar to each other but different from the observations (\(\M_C \,\mathtt{vs}\, \M_D\)), and one where only one of the candidates is similar to the observations (\(\M_A \,\mathtt{vs}\, \M_D\)). For low to moderate \(c\), we see that \(\Bconf{}\) strongly correlates with \(\Bemd{}\), which confirms that \(\Bemd{ab;c}\) can be an estimator for the probability \(P(R_a < R_b)\) (recall Eq. 2.12 and Eq. 2.28). In this case we chose the value \(c = 2^{-2}\) since it avoids the overconfidence (red and yellow) regions under most conditions.

It is worth noting that this calibration procedure works best between similar models: we want to probe the full range of possible values for \(\Bemd{}\) and \(\Bconf{}\), both of which take values in \((0, 1)\). In particular this requires that for a subset of the simulated experiments, the risk should be similar under both models.

Also worth noting is that there is an upper limit to the value we can choose for \(c\), as evidenced by the curves reversing in Fig. 2.5 when we set \(c=2^4\). This can be understood as the monotonicity constraint placing an upper bound on the achievable metric variance of \(\qproc\).

2.8. Characterizing the behaviour of \(R\)-distributions#

To better anchor the interpretability of \(R\)-distributions, in this section we perform a more systematic study of the relationship between epistemic uncertainty, aleatoric uncertainty, and the shape of the \(R\)-distributions. To do this we use a different example, chosen for its illustrative simplicity, which allows us to independently adjust the ambiguity (how much two models are qualitatively similar) and the level of observation noise.

Concretely, we imagine a fictitious historical scenario where the Rayleigh-Jeans

(2.31)#\[\Bspec_{\mathrm{RJ}}(λ; T) = \frac{2 c k_B T}{λ^4}\]

and Planck

(2.32)#\[\Bspec_\mathrm{P}(λ; T) = \frac{2 h c^2}{λ^5} \frac{1}{\exp\left( \frac{hc}{λ k_B T} \right) - 1}\]

models for the radiance of a black body are two candidate models given equal weight in the scientific community. They stem from different theories of statistical physics, but both agree with observations at infrared or longer wavelengths (Fig. 2.6), and so both are plausible if observations are limited to that window. (When it is extended to shorter wavelengths, the predictions diverge and it becomes clear that the Planck model is the correct one.)

Remark

For our purposes, these are just two models describing the relationship between an independent variable \(λ\) (the wavelength) and a dependent variable \(\Bspec\) (the spectral radiance), given a parameter \(T\) (the temperature) which is inferred from data; our discussion is agnostic to the underlying physics. The parameters \(h\), \(c\) and \(k_B\) are known physical constants (the Planck constant, the speed of light and the Boltzmann constant) and can be omitted from the discussion.

We use a simple Poisson counting process to model the data-generating model \(\Mtrue\) including the observation noise:

(2.33)#\[\begin{aligned} \Bspec \mid λ, T, s &\sim \frac{1}{s}\Poisson\bigl(s \, \Bspec_{\mathrm{P}}(λ; T)\bigr) + \Bspec_0 \,, \end{aligned}\]

where \(s\) is a parameter related to the gain of the detector (see the Methods for details). Most relevant to the subsequent discussion is that the mean and variance of \(\Bspec\) are

\[\begin{split}\begin{aligned} \EE[\Bspec] &= \Bspec_{\mathrm{P}}(λ; T) + \Bspec_0 \,, \\ \VV[\Bspec] &= \frac{\Bspec_{\mathrm{P}}(λ; T)}{s} \,, \end{aligned}\end{split}\]

and can therefore be independently controlled with the parameters \(\Bspec_0\) and \(s\).

For the purposes of this example, both candidate models \(\MRJ\) and \(\MP\) make the incorrect (but common) assumption of additive Gaussian noise, such that instead of Eq. 2.33 they assume

(2.34)#\[\begin{aligned} \Bspec \mid λ, T, σ &\sim \nN\bigl(\Bspec_{a}(λ; T), σ^2\bigr) \,, \end{aligned}\]

with \(\Bspec_a \in \{\Bspec_{\mathrm{RJ}}, \Bspec_{\mathrm{P}}\}\) and \(σ > 0\). This ensures that there is always some amount of mismatch between \(\Mtrue\) and the two candidates. That mismatch is increased when \(\Bspec_0 > 0\), which we interpret as a sensor bias which the candidate models neglect.

With this setup, we have four parameters which move the problem along three different “axes”: The parameters \(λ_{\mathrm{min}}\) and \(λ_{\mathrm{max}}\) determine the spectrometer’s detection window, and thereby the ambiguity: the shorter the wavelength, the easier it is to distinguish the two models. The parameter \(s\) determines the level of noise. The parameter \(\Bspec_0\) determines an additional amount of misspecification between the candidate model and the data.

We explore these three axes in Fig. 2.7, and illustrate how the overlap of the \(R_{\mathrm{P}}\) and \(R_{\mathrm{RJ}}\) distributions changes through mainly two mechanisms: Better data can shift one \(R\)-distribution more than the other, and/or it can tighten one or both of the \(R\)-distributions. Either of these effects can increase the separability of the two distributions (and therefore the strength of the evidence for rejecting one of them).

../_images/uv_dataset-and-Rdist-grids.svg

Fig. 2.7 \(R\)-distributions (right) for different simulated datasets of spectral radiance (left). Datasets were generated using Eq. 2.33. For each model \(a\), the \(R_a\)-distribution was obtained by sampling an HB process \(\qproc\) parametrised by \(δ^{\EMD}_a\); see the respective Results sections for definitions of \(\qproc\) and \(δ^{\EMD}\). A kernel density estimate is used to visualise the resulting samples (Eq. 2.26) as densities. In all rows, noise (\(s\)) decreases left to right.#

Top row: Over a range of long wavelengths with positive bias \(\Bspec_0 = 0.0015\), both models fit the data equally well. Middle row: Same noise levels as the first row, but now the bias is zero. Planck model is now very close to \(\Mtrue\), and consequently has nearly-Dirac \(R\)-distributions and lower expected risk. Bottom row: At visible wavelengths, the better fit of the Planck model is incontrovertible. [source]

2.9. Different criteria embody different notions of robustness#

As a final step, we now wish to locate our proposed method within the wider constellation of model selection methods.

The motivation behind any model selection criterion is to make the selection in some way robust – i.e. to encourage a choice which is good not just for the given particular data and models, but also for variations thereof. Otherwise we would just select the model with the lowest empirical risk and be done with it. Criteria vary widely however in terms of what kinds of variations they account for. For the purposes of comparing with the EMD rejection rule, we consider three types:

In-distribution variations of the data

where new samples are drawn from the same data-generating process \(\Mtrue\) as the data.

Out-of-distribution variations of the data

where new samples are drawn from a different data-generating process \(\Mtrue'\) (than the data-generating process \(\Mtrue\)). How much \(\Mtrue'\) is allowed to differ from \(\Mtrue\) will depend on the target application.

Variations of model parameters,

for example by sampling from a Bayesian posterior, or refitting the model to new data.

Robustness to data variations, whether in- or out-of-distribution, is generally considered a positive. In contrast, robustness to model variations (what might more commonly be described as insensitivity to model parameters) is often attributed to excess model complexity and thus considered a negative.

Exactly which type of variation a criterion accounts for – and just as importantly, how it does so – defines what we might call the paradigm for that criterion. In Fig. 2.8 we compare the EMD approach to five other commonly used criteria with a variety of model selection paradigms; we expand on those differences below.

For our purposes the most important of those differences is how a criterion accounts for epistemic uncertainty. Four of the considered criteria do this by comparing parametrised families of models – either explicitly by averaging their performance over a prior (BIC, \(\log \eE\)), or implicitly by refitting the model to each new dataset (MDL, AIC). In other words, such criteria compare the mathematical structure of different models, rather then specific parameters; they would not be suited to the neural circuit example of earlier sections, where the candidate models were solely distinguished by their parameters. The comparison we do here is therefore only possible when the models being compared are structurally distinct.

Our comparison is also tailored to highlight features of particular importance for experimental data and which differentiate our method. It is by no means exhaustive, and many aspects of model comparison are omitted – such as their consistency or how they account for informative priors. The example data were also chosen to be in a regime favourable to all models; this is partly why many panels show similar behaviour.

../_images/compare-other-criteria-asymptotics.svg

Fig. 2.8 Behaviour of model selection criteria as a function of dataset size Datasets were generated using Eq. 2.33; they differ only in their number of samples and random seed; an example is shown in the top right. The comparison outcome between two models is given by the difference between their criteria; these are plotted as differences w.r.t. an arbitrary reference model (blue curve). The log evidence (\(\log \eE\)) and \(\elpd\) provide uncertainties, shown either as shading or as error bars. MDL curves are shorter because computations exceeding 100 hours were aborted.#

We can broadly classify criteria according to whether they select models based solely on their predictive accuracy, or whether they also penalize their complexity (so as to prefer simpler models).

Let us begin with the latter. The most common of these is likely the Bayes factor, formally defined as the ratio of model probabilities and in practise the ratio of marginal likelihoods, or model evidence [30, 44, 45]:

(2.35)#\[B^{\text{Bayes}}_{AB} = \frac{p(\M_A \mid \D)} {p(\M_B \mid \D)} = \frac{\int p(\D \mid θ, \M_A) π_A(θ) dθ} {\int p(\D \mid θ, \M_B) π_B(θ) dθ} \eqqcolon \frac{\eE_A} {\eE_B} \,.\]

Here \(p(\D \mid θ, \M_A)\) and \(π_A\) are likelihood and prior distributions respectively. Going forward, instead of Bayes factors we will consider each model’s log evidence: the former are easily recovered from the difference of the latter,

\[\log B^{\text{Bayes}}_{AB} = \log \eE_A - \log \eE_B \,,\]

and this makes it easier to compare multiple models at once. It also matches the conventional format of information criteria. Numerical evaluation of the model evidence is non-trivial, but generic software solutions are available. For this comparison we used an implementation of nested sampling [46] provided by the Python package dynesty [47].

The model evidence corresponds to studying variations of the model (represented by the prior) for a fixed dataset \(\D\). A model is penalised if its prior is unnecessarily broad or high-dimensional; this is the so-called “Occam’s razor” principle which favours simpler models [44, 45] . This sometimes leads to undesirable results when priors are non-informative [22, §6].

The Bayesian information criterion (BIC) is an asymptotic approximation of the log evidence, either around the maximum likelihood or maximum a posteriori estimate [30, 34]. It will approach the log evidence (\(\log \eE\)) when the likelihood is non-singular and approximately Gaussian around its maximum.

Another approach which penalizes complexity is the minimum description length (MDL). This is more accurately described as a methodological frame built around the choice of a “universal probability distribution” [48]; usually however the normalised maximum likelihood (NML) distribution is implied, and this is what we do here. Assuming equal prior preference for each model, MDL selects the one with the highest NML probability; equivalently, it computes for each model the quantity

(2.36)#\[\MDL_A \coloneqq -\max_θ p(\D\mid\M_A(θ)) + \comp(\M_A) \,.\]

and selects the model for which \(\MDL_A\) is minimal. (It is assumed here that \(\M_A\) is parametrised, and \(\M_A(θ)\) denotes the model \(\M_A\) with its parameters set to the values \(θ\).) The first term is the maximum of the likelihood evaluated on the training data. The second term is the MDL complexity, defined by integrating over the entire event space:

(2.37)#\[\comp(\M_A) \coloneqq \int \max_θ p(\D'\mid\M_A(θ)) \,d\D' \,.\]

In other words, MDL complexity quantifies the ability of a model to fit arbitrary data. Note that it does not matter here what data-generating process \(\Mtrue\) generated the data since each possible dataset is given equal weight. Approximations are almost always required to compute Eq. 2.37 since the integral is usually intractable; for this comparison, we chose a data regime where it is just about possible to enumerate the possible datasets and estimate \(\comp(\M_A)\) directly with Monte Carlo.

With MDL therefore we consider variations of the models, since each model is refitted to each dataset \(\D'\). We have no in-distribution dataset variations (the true process \(\Mtrue\) is ignored), but we do have out-of-distribution variations in the form of the integral over the event space.

Moving on now to criteria which compare predictive accuracy, we start with the expected log pointwise predictive density (elpd) [31]. This is defined as the expectation of the loss \(Q\) for \(L\) samples drawn from the data-generating process:

(2.38)#\[\begin{align} \label{eq_def_elpd} \elpd_A &:= \sum_{i=1}^L \EE_{(x_i, y_i) \sim \Mtrue}\bigl[Q(x_i, y_i; \M_A)\bigr] \\ &= L\,R_A \,. \notag \end{align}\]

The most common form of the \(\elpd\) uses the (negative) log likelihood as the loss function \(Q\). In this form where each sample is drawn from the same distribution, the \(\elpd\) is therefore proportional to the risk \(R_A\) defined in Eq. 2.2, and therefore conceptually analogous: it describes robustness to in-distributions variations of the dataset. The models are kept fixed in these comparisons; they are not refit to the new data. The uncertainty on \(\elpd\) comes from the finite number of samples used to compute the expectation.

Since in current practice the \(\elpd\) is mostly used with Bayesian models, this is also what we do in Fig. 2.8. Efficient estimators exist for computing the elpd from training data, using either leave-one-out cross-validation [31] or the widely applicable information criterion (WAIC) [49]. This avoids the need to reserve a separate test dataset. We used ArviZ’s [50] implementation of WAIC for this comparison.

In contrast to the \(\elpd\), the Akaike information criterion (AIC) does not assume a fixed model. Although it also ranks models according to their expected loss, this time the expectation is (conceptually) over both the test set \(\D_\test\) and training data \(\D_\train\), both drawn independently from the data-generating process [32, 33]:

\[\elpd^{\AIC}_A = \EE_{\D_\test \sim \Mtrue} \EE_{\D_\train \sim \Mtrue} \Bigl[ Q\bigl(\D_\test; \M_A\bigl(\hat{θ}_{\D_\train}\bigr)\bigr) \Bigr]\]

where \(\hat{θ}_{\D_\train}\) is the maximum likelihood estimate given the training data. Concretely however the \(\AIC\) is a simple statistic, similar to the BIC and other information criteria:

\[\AIC_A \coloneqq 2Q\bigl(\D_\train; \M_A(\hat{θ}\bigl(\D_\train\bigr)\bigr) + 2k_A \,,\]

where \(k_A\) is the number of parameters of model \(\M_A\). The derivation of the \(\AIC\) relies on \(Q\) being the log likelihood, and assumes that the likelihood is approximately Gaussian and non-singular around \(\hat{θ}_{\D_\train}\). When these conditions are verified, then \(Δ\AIC_{ab} := \AIC_A - \AIC_B\) is an unbiased estimator of the difference \(\elpd^{\AIC}_A - \elpd^{\AIC}_B\). In this case, the model with the lowest \(\AIC\) will – in expectation – also be the one with the lowest expected risk.

So how does the \(\Bemd{}\) compare to the criteria listed above? First, it computes \(R\)-distributions for fixed models. This is a key feature, since otherwise it is impossible to compare different parameter vectors for otherwise structurally identical models. (This is not a requirement though: EMD also allows models to be structurally different.) It also means that the \(\Bemd{}\) does not penalise model complexity, in contrast to the evidence, MDL and AIC.

Second, like the \(\elpd\), the \(\Bemd{}\) accounts for in-distribution variability by estimating the risk using held-out test data \(\D_\test\).

Third, EMD \(R\)-distributions also account for out-of-distribution variability. Conceptually this is done via the epistemic distributions \(Ω\), concretely via the HB process on PPFs – the variance of the latter stemming ultimately from the discrepancy function \(δ^{\EMD}\). Better models therefore result in less variability, reflecting our better grasp of the process being modelled. In contrast, only the MDL allows for some form of out-of-distribution variability (in the form of a complexity penalty), and it is essentially rigid: always an integral over the entire event space.

The biggest difference between the EMD \(R\)-distributions and other criteria however is their behaviour as the number of samples \(L\) increases. This we illustrate in the lower part of Fig. 2.8, which plots the value of a criterion for different models as a function of \(L\). We do this for three collections of models, where the data-generating process is always the Planck model \(\Bspec_\mathrm{P}\) described in the previous section. In all panels, scores are defined so that models with lower scores are preferred.

In the first row we compare four candidate models of the form

\[\Bspec \mid λ, T, σ, \{b_j\} \sim \nN \,\biggl(\Bspec_{\mathrm{P}}(λ; T) + \sum_{j=0}^m b_j λ^j,\; σ^2 \Bspec_{\mathrm{P}}(λ; T) \biggr) \,.\]

The candidates are therefore nested: any model within the class \(\M_{m=2}\) can recovered from the class \(\M_{m=4}\) by fixing \(b_3=b_4=0\). Moreover, all models effectively contain \(\Mtrue\), since it corresponds to \(m=0\). (The data are in a regime where the Poisson distribution of \(\Mtrue\) is close to Gaussian.)

The second row is similar, but expands around the Rayleigh-Jeans model instead:

\[\Bspec \mid λ, T, σ, \{b_j\} \sim \nN \,\biggl(\Bspec_{\mathrm{RJ}}(λ; T) + \sum_{j=0}^m b_j λ^j,\; σ^2 \Bspec_{\mathrm{RJ}}(λ; T) \biggr) \,.\]

In this case all models are misspecified, but those with higher degree polynomials are better able to compensate for this and thus achieve lower prediction errors.

The third row compares the two \(m=0\) models from the first two rows.

If we interpret the score difference between two models (measured as the horizontal distance between two curves) as the strength of the evidence supporting a particular model choice, then it is clear from Fig. 2.8 that for all criteria except EMD, strength grows unbounded as we increase the number of samples \(L\). A bounded strength however is desirable when analyzing experimental data, since no experiment has infinite accuracy.

In contrast, we see that the \(R\)-distributions in the grey shaded area stabilize as a function of \(L\). Therefore also the tail probabilities \(\Bemd{ab;c} = P(R_A < R_B \,:\, c)\) (Eq. 2.12), which define the EMD rejection rule, converge in general to finite, non-zero values. In some cases the \(R\)-distributions may approach a Dirac delta as \(L\) increases, as we see in the first row of Fig. 2.8. This occurs when the candidate model is able to match \(\Mtrue\) exactly.

We provide a different set of comparisons in the Supplementary Results, listing numerical values of criteria differences in Table 6.1 to complement the graphical presentation of Fig. 2.8. These alternative comparisons place greater emphasis on how misspecification can affect a criterion’s value, and how that value can (mis)represent the strength of the evidence.