6. Supplementary Results#

\(\require{mathtools} \newcommand{\notag}{} \newcommand{\tag}{} \newcommand{\label}[1]{} \newcommand{\sfrac}[2]{#1/#2} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\num}[1]{#1} \newcommand{\qty}[2]{#1\,#2} \renewenvironment{align} {\begin{aligned}} {\end{aligned}} \renewenvironment{alignat} {\begin{alignedat}} {\end{alignedat}} \newcommand{\pdfmspace}[1]{} % Ignore PDF-only spacing commands \newcommand{\htmlmspace}[1]{\mspace{#1}} % Ignore PDF-only spacing commands \newcommand{\scaleto}[2]{#1} % Allow to use scaleto from scalerel package \newcommand{\RR}{\mathbb R} \newcommand{\NN}{\mathbb N} \newcommand{\PP}{\mathbb P} \newcommand{\EE}{\mathbb E} \newcommand{\XX}{\mathbb X} \newcommand{\ZZ}{\mathbb Z} \newcommand{\QQ}{\mathbb Q} \newcommand{\fF}{\mathcal F} \newcommand{\dD}{\mathcal D} \newcommand{\lL}{\mathcal L} \newcommand{\gG}{\mathcal G} \newcommand{\hH}{\mathcal H} \newcommand{\nN}{\mathcal N} \newcommand{\pP}{\mathcal P} \newcommand{\BB}{\mathbb B} \newcommand{\Exp}{\operatorname{Exp}} \newcommand{\Binomial}{\operatorname{Binomial}} \newcommand{\Poisson}{\operatorname{Poisson}} \newcommand{\linop}{\mathcal{L}(\mathbb{B})} \newcommand{\linopell}{\mathcal{L}(\ell_1)} \DeclareMathOperator{\trace}{trace} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Span}{span} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\col}{col} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\gt}{>} \definecolor{highlight-blue}{RGB}{0,123,255} % definition, theorem, proposition \definecolor{highlight-yellow}{RGB}{255,193,7} % lemma, conjecture, example \definecolor{highlight-orange}{RGB}{253,126,20} % criterion, corollary, property \definecolor{highlight-red}{RGB}{220,53,69} % criterion \newcommand{\logL}{\ell} \newcommand{\eE}{\mathcal{E}} \newcommand{\oO}{\mathcal{O}} \newcommand{\defeq}{\stackrel{\mathrm{def}}{=}} \newcommand{\Bspec}{\mathcal{B}} % Spectral radiance \newcommand{\X}{\mathcal{X}} % X space \newcommand{\Y}{\mathcal{Y}} % Y space \newcommand{\M}{\mathcal{M}} % Model \newcommand{\Tspace}{\mathcal{T}} \newcommand{\Vspace}{\mathcal{V}} \newcommand{\Mtrue}{\mathcal{M}_{\mathrm{true}}} \newcommand{\MP}{\M_{\mathrm{P}}} \newcommand{\MRJ}{\M_{\mathrm{RJ}}} \newcommand{\qproc}{\mathfrak{Q}} \newcommand{\D}{\mathcal{D}} % Data (true or generic) \newcommand{\Dt}{\tilde{\mathcal{D}}} \newcommand{\Phit}{\widetilde{\Phi}} \newcommand{\Phis}{\Phi^*} \newcommand{\qt}{\tilde{q}} \newcommand{\qs}{q^*} \newcommand{\qh}{\hat{q}} \newcommand{\AB}[1]{\mathtt{AB}~\mathtt{#1}} \newcommand{\LP}[1]{\mathtt{LP}~\mathtt{#1}} \newcommand{\NML}{\mathrm{NML}} \newcommand{\iI}{\mathcal{I}} \newcommand{\true}{\mathrm{true}} \newcommand{\dist}{D} \newcommand{\Mtheo}[1]{\mathcal{M}_{#1}} % Model (theoretical model); index: param set \newcommand{\DL}[1][L]{\mathcal{D}^{(#1)}} % Data (RV or generic) \newcommand{\DLp}[1][L]{\mathcal{D}^{(#1')}} % Data (RV or generic) \newcommand{\DtL}[1][L]{\tilde{\mathcal{D}}^{(#1)}} % Data (RV or generic) \newcommand{\DpL}[1][L]{{\mathcal{D}'}^{(#1)}} % Data (RV or generic) \newcommand{\Dobs}[1][]{\mathcal{D}_{\mathrm{obs}}^{#1}} % Data (observed) \newcommand{\calibset}{\mathcal{C}} \newcommand{\N}{\mathcal{N}} % Normal distribution \newcommand{\Z}{\mathcal{Z}} % Partition function \newcommand{\VV}{\mathbb{V}} % Variance \newcommand{\T}{\mathsf{T}} % Transpose \newcommand{\EMD}{\mathrm{EMD}} \newcommand{\dEMD}{d_{\mathrm{EMD}}} \newcommand{\dEMDtilde}{\tilde{d}_{\mathrm{EMD}}} \newcommand{\dEMDsafe}{d_{\mathrm{EMD}}^{\text{(safe)}}} \newcommand{\e}{ε} % Model confusion threshold \newcommand{\falsifythreshold}{ε} \newcommand{\bayes}[1][]{B_{#1}} \newcommand{\bayesthresh}[1][]{B_{0}} \newcommand{\bayesm}[1][]{B^{\mathcal{M}}_{#1}} \newcommand{\bayesl}[1][]{B^l_{#1}} \newcommand{\bayesphys}[1][]{B^{{p}}_{#1}} \newcommand{\Bconf}[1]{B^{\mathrm{epis}}_{#1}} \newcommand{\Bemd}[1]{B^{\mathrm{EMD}}_{#1}} \newcommand{\Bconfbin}[1][]{\bar{B}^{\mathrm{conf}}_{#1}} \newcommand{\Bemdbin}[1][]{\bar{B}_{#1}^{\mathrm{EMD}}} \newcommand{\bin}{\mathcal{B}} \newcommand{\Bconft}[1][]{\tilde{B}^{\mathrm{conf}}_{#1}} \newcommand{\fc}{f_c} \newcommand{\fcbin}{\bar{f}_c} \newcommand{\paramphys}[1][]{Θ^{{p}}_{#1}} \newcommand{\paramobs}[1][]{Θ^{ε}_{#1}} \newcommand{\test}{\mathrm{test}} \newcommand{\train}{\mathrm{train}} \newcommand{\synth}{\mathrm{synth}} \newcommand{\rep}{\mathrm{rep}} \newcommand{\MNtrue}{\mathcal{M}^{{p}}_{\text{true}}} \newcommand{\MN}[1][]{\mathcal{M}^{{p}}_{#1}} \newcommand{\MNA}{\mathcal{M}^{{p}}_{Θ_A}} \newcommand{\MNB}{\mathcal{M}^{{p}}_{Θ_B}} \newcommand{\Me}[1][]{\mathcal{M}^ε_{#1}} \newcommand{\Metrue}{\mathcal{M}^ε_{\text{true}}} \newcommand{\Meobs}{\mathcal{M}^ε_{\text{obs}}} \newcommand{\Meh}[1][]{\hat{\mathcal{M}}^ε_{#1}} \newcommand{\MNa}{\mathcal{M}^{\mathcal{N}}_a} \newcommand{\MeA}{\mathcal{M}^ε_A} \newcommand{\MeB}{\mathcal{M}^ε_B} \newcommand{\Ms}{\mathcal{M}^*} \newcommand{\MsA}{\mathcal{M}^*_A} \newcommand{\MsB}{\mathcal{M}^*_B} \newcommand{\Msa}{\mathcal{M}^*_a} \newcommand{\MsAz}{\mathcal{M}^*_{A,z}} \newcommand{\MsBz}{\mathcal{M}^*_{B,z}} \newcommand{\Msaz}{\mathcal{M}^*_{a,z}} \newcommand{\MeAz}{\mathcal{M}^ε_{A,z}} \newcommand{\MeBz}{\mathcal{M}^ε_{B,z}} \newcommand{\Meaz}{\mathcal{M}^ε_{a,z}} \newcommand{\zo}{z^{0}} \renewcommand{\lL}[2][]{\mathcal{L}_{#1|{#2}}} % likelihood \newcommand{\Lavg}[2][]{\mathcal{L}^{/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLphys}[2][]{\mathcal{L}^{{p}}_{#1|#2}} \newcommand{\Lavgphys}[2][]{\mathcal{L}^{{p}/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLL}[3][]{\mathcal{L}^{(#3)}_{#1|#2}} \newcommand{\lLphysL}[3][]{\mathcal{L}^{{p},(#3)}_{#1|#2}} \newcommand{\lnL}[2][]{l_{#1|#2}} % Per-sample log likelihood \newcommand{\lnLt}[2][]{\widetilde{l}_{#1|#2}} \newcommand{\lnLtt}{\widetilde{l}} % Used only in path_sampling \newcommand{\lnLh}[1][]{\hat{l}_{#1}} \newcommand{\lnLphys}[2][]{l^{{p}}_{#1|#2}} \newcommand{\lnLphysL}[3][]{l^{{p},(#3)}_{#1|#2}} \newcommand{\Elmu}[2][1]{μ_{{#2}}^{(#1)}} \newcommand{\Elmuh}[2][1]{\hat{μ}_{{#2}}^{(#1)}} \newcommand{\Elsig}[2][1]{Σ_{{#2}}^{(#1)}} \newcommand{\Elsigh}[2][1]{\hat{Σ}_{{#2}}^{(#1)}} \newcommand{\pathP}{\mathop{{p}}} % Path-sampling process (generic) \newcommand{\pathPhb}{\mathop{{p}}_{\mathrm{Beta}}} % Path-sampling process (hierarchical beta) \newcommand{\interval}{\mathcal{I}} \newcommand{\Phiset}[1]{\{\Phi\}^{\small (#1)}} \newcommand{\Phipart}[1]{\{\mathcal{I}_Φ\}^{\small (#1)}} \newcommand{\qhset}[1]{\{\qh\}^{\small (#1)}} \newcommand{\Dqpart}[1]{\{Δ\qh_{2^{#1}}\}} \newcommand{\LsAzl}{\mathcal{L}_{\smash{{}^{\,*}_A},z,L}} \newcommand{\LsBzl}{\mathcal{L}_{\smash{{}^{\,*}_B},z,L}} \newcommand{\lsA}{l_{\smash{{}^{\,*}_A}}} \newcommand{\lsB}{l_{\smash{{}^{\,*}_B}}} \newcommand{\lsAz}{l_{\smash{{}^{\,*}_A},z}} \newcommand{\lsAzj}{l_{\smash{{}^{\,*}_A},z_j}} \newcommand{\lsAzo}{l_{\smash{{}^{\,*}_A},z^0}} \newcommand{\leAz}{l_{\smash{{}^{\,ε}_A},z}} \newcommand{\lsAez}{l_{\smash{{}^{*ε}_A},z}} \newcommand{\lsBz}{l_{\smash{{}^{\,*}_B},z}} \newcommand{\lsBzj}{l_{\smash{{}^{\,*}_B},z_j}} \newcommand{\lsBzo}{l_{\smash{{}^{\,*}_B},z^0}} \newcommand{\leBz}{l_{\smash{{}^{\,ε}_B},z}} \newcommand{\lsBez}{l_{\smash{{}^{*ε}_B},z}} \newcommand{\LaszL}{\mathcal{L}_{\smash{{}^{*}_a},z,L}} \newcommand{\lasz}{l_{\smash{{}^{*}_a},z}} \newcommand{\laszj}{l_{\smash{{}^{*}_a},z_j}} \newcommand{\laszo}{l_{\smash{{}^{*}_a},z^0}} \newcommand{\laez}{l_{\smash{{}^{ε}_a},z}} \newcommand{\lasez}{l_{\smash{{}^{*ε}_a},z}} \newcommand{\lhatasz}{\hat{l}_{\smash{{}^{*}_a},z}} \newcommand{\pasz}{p_{\smash{{}^{*}_a},z}} \newcommand{\paez}{p_{\smash{{}^{ε}_a},z}} \newcommand{\pasez}{p_{\smash{{}^{*ε}_a},z}} \newcommand{\phatsaz}{\hat{p}_{\smash{{}^{*}_a},z}} \newcommand{\phateaz}{\hat{p}_{\smash{{}^{ε}_a},z}} \newcommand{\phatseaz}{\hat{p}_{\smash{{}^{*ε}_a},z}} \newcommand{\Phil}[2][]{Φ_{#1|#2}} % Φ_{\la} \newcommand{\Philt}[2][]{\widetilde{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philhat}[2][]{\hat{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philsaz}{Φ_{\smash{{}^{*}_a},z}} % Φ_{\lasz} \newcommand{\Phileaz}{Φ_{\smash{{}^{ε}_a},z}} % Φ_{\laez} \newcommand{\Philseaz}{Φ_{\smash{{}^{*ε}_a},z}} % Φ_{\lasez} \newcommand{\mus}[1][1]{μ^{(#1)}_*} \newcommand{\musA}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\SigsA}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\musB}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\SigsB}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\musa}[1][1]{μ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Sigsa}[1][1]{Σ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Msah}{{\color{highlight-red}\mathcal{M}^{*}_a}} \newcommand{\Msazh}{{\color{highlight-red}\mathcal{M}^{*}_{a,z}}} \newcommand{\Meah}{{\color{highlight-blue}\mathcal{M}^{ε}_a}} \newcommand{\Meazh}{{\color{highlight-blue}\mathcal{M}^{ε}_{a,z}}} \newcommand{\lsazh}{{\color{highlight-red}l_{\smash{{}^{*}_a},z}}} \newcommand{\leazh}{{\color{highlight-blue}l_{\smash{{}^{ε}_a},z}}} \newcommand{\lseazh}{{\color{highlight-orange}l_{\smash{{}^{*ε}_a},z}}} \newcommand{\Philsazh}{{\color{highlight-red}Φ_{\smash{{}^{*}_a},z}}} % Φ_{\lasz} \newcommand{\Phileazh}{{\color{highlight-blue}Φ_{\smash{{}^{ε}_a},z}}} % Φ_{\laez} \newcommand{\Philseazh}{{\color{highlight-orange}Φ_{\smash{{}^{*ε}_a},z}}} % Φ_{\lasez} \newcommand{\emdstd}{\tilde{σ}} \DeclareMathOperator{\Mvar}{Mvar} \DeclareMathOperator{\AIC}{AIC} \DeclareMathOperator{\epll}{epll} \DeclareMathOperator{\elpd}{elpd} \DeclareMathOperator{\MDL}{MDL} \DeclareMathOperator{\comp}{COMP} \DeclareMathOperator{\Lognorm}{Lognorm} \DeclareMathOperator{\erf}{erf} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\Image}{Image} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\SE}{SE} % standard error \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\SkewNormal}{SkewNormal} \DeclareMathOperator{\TruncNormal}{TruncNormal} \DeclareMathOperator{\Exponential}{Exponential} \DeclareMathOperator{\exGaussian}{exGaussian} \DeclareMathOperator{\IG}{IG} \DeclareMathOperator{\NIG}{NIG} \DeclareMathOperator{\Gammadist}{Gamma} \DeclareMathOperator{\Lognormal}{Lognormal} \DeclareMathOperator{\Beta}{Beta} \newcommand{\sinf}{{s_{\infty}}}\)

6.1. Additional calibration curves for the neuron model#

In Calibrating and validating the BEMD, we considered only the effect of varying the external input strength (\(σ_i\)) on the calibration curves for the Prinz model. Fig. 6.1 is an extended version of Fig. 2.5, where all 48 proposed epistemic distributions are plotted.

../../_images/prinz_calibrations_all_combined.svg

Fig. 6.1 Additional calibration curves for the neuron model, computed following the procedure described in section Calibrating and validating the BEMD of the main text. As in Fig. 2.5, each curve summarizes 512 simulated experiments with datasets of size 4000. The regions depicted in red and yellow are those where Eq. 2.29 is violated.#

6.2. Comparison of selection criteria with inconclusive data#

A key motivation underlying the \(\Bemd{}\) was to define a selection criterion which provides reliable estimates of its uncertainty, in particular for large test datasets (thousands of samples or more): these are commonly produced by experiments and yet produce pathologies with many of the common statistical methods of model selection. In short, a criterion can both over and underestimate the strength of the evidence in favour of one model.

We illustrate this phenomenon in Table 6.1, with a set of comparisons meant to complement those of Fig. 2.8 in the main text and which explore the effect of misspecification in greater detail. We use 18 different variations of data generated with Eq. 2.33: three levels of noise \(s\) (ranging from high, \(2^{12}\), to low \(2^{20}\) [\({\mathrm{m}^{2} \cdot \mathrm{nm} \cdot \mathrm{photons} \cdot \mathrm{sr}} / {\mathrm{kW}}\)]), three dataset sizes \(L\) (\(2^{9}\), \(2^{12}\), and \(2^{15}\)) and two wavelength ranges (20–1000 \(\mathrm{µm}\) and 15–30 \(\mathrm{µm}\)). Datasets with high noise and long wavelengths provide almost no discriminatory information: the two model predictions are almost the same, and any discrepancy between them is dwarfed by the amount of noise. At the other end, the low noise, short wavelength datasets are at the threshold between moderate and strong evidence in favour of \(\MP\) – in this regime the better model can already be identified by eye.

If Fig. 2.8 in the main text compares criteria on their terms, on a type of problem they were designed to tackle, Table 6.1 explores what happens in a more challenging situation with misspecified models and an ambiguous selection problem. As with Fig. 2.8, all comparisons are done on the basis of a single test dataset, so replication uncertainty does not play a role.

In addition to the \(\Bemd{}\) and the risk (\(B^R\)), Table 6.1 includes criteria based on the likelihood ratio (\(B^l\)), the Bayes factor (\(B^{\mathrm{Bayes}}\)) [30, 45] and the expected log predictive density (elpd) (\(B^{\mathrm{elpd}}\)) [30, 31]. We note that because \(\MP\) and \(\MRJ\) have the same number of parameters, the likelihood ratio is also equivalent to the Akaike Information Criterion (AIC), up to a factor of 2 [30].

To allow for comparability, for each criterion we report the log probability ratio; i.e. “model \(A\) is \(B^C_{AB}\) times more probable than model \(B\).” Conceptually, if \(P(A)\) is the “probability of model \(A\)” and \(P(B)\) the “probability of model \(B\)”, then a criterion corresponds to

(6.1)#\[B^C_{AB} = \frac{P(A)}{P(B)} \quad \leftrightarrow \quad \log_{10} B^C_{AB} = \log_{10} P(A) - \log_{10} P(B) \,.\]

This is the typical form for Bayes’ factors, and we use Jeffreys’ scale [45, 72] to interpret values in the Table 6.1: values near \(0\), \(\pm 1\) and \(\pm 2\) respectively correspond to inconclusive, weak, and strong evidence.

Although only Bayes factors are typically reported in this form, most information criteria are defined in terms of log probabilities; the difference between the criteria of two models (the sign of which determines which model is selected) can be identified with the difference \(\log_{10} P(A) - \log_{10} P(B)\) in Eq. 6.1.

The main exception is the EMD criterion, which is defined in terms of a probability \(P(R_A < R_B)\) rather than separate probabilities for \(A\) and \(B\). To express it as a probability ratio, we define the “underbar” quantity, obtained by applying a logit transformation to \(\Bemd{AB;c}\):

(6.2)#\[\log_{10} \underline{B}^{\EMD}_{AB;c} \coloneqq \log_{10} {\Bemd{AB;c}} - \log_{10} \bigl(1 - \Bemd{AB;c}\bigr) \,.\]

Values of \(\underline{B}^{\EMD}_{AB;c}\) are therefore not directly comparable with those of \(\Bemd{AB;c}\) reported elsewhere in this paper. In Table 6.1, the \(\log_{10} \underline{B}^{\EMD}_{AB;c}\) ranges from \(-\infty\) to \(\infty\), with values \(\pm \infty\) indicating that the two \(R\)-distributions have zero overlap.

As a comparison, we also define \(\underline{B}^R\), to show what happens if we simply compare the true risk:

(6.3)#\[\log_{10} \underline{B}^R_{AB} \coloneqq (-R_{A} + R_{B}) \;/\; \log 10 \,.\]

The division by \(\log 10\) converts the basis of the logarithms from \(e\) to \(10\).

To interpret the results summarised in Table 6.1, we highlight two types of undesirable behaviour:

Lack of saturation

There should always come a point where enough data have been collected, and simply enlarging the dataset with samples from the same distribution does not provide more information. In the table therefore we want values to converge to a finite value as \(L\) increases.

Non-robustness

The magnitude of a criteria should be an indicator of its robustness. If small changes in a dataset cause a criterion to flip from strongly negative to strongly positive, this suggests that its magnitude is not a good indication for the strength of the evidence.

If we are to interpret the values of criteria as relative probabilities, the global pattern we would expect to see in Table 6.1 is that cells become progressively more blue as we go to the right (less noise, less model mismatch) and as we go down (more data, less ambiguous data).

Comparison of different model selection criteria for variations of the datasets shown in \cref{fig_uv-example_r-distributions}. Criteria compare the Planck model (\(\MP\)) against the Rayleigh-Jeans model (\(\MRJ\)) and are evaluated for different dataset sizes (\(L\)), different levels of noise (\(s\)) and different wavelength windows (\(λ\)). To allow for comparisons, all criteria have been transformed into ratios of probabilities. Values reported are the \(\log_{10}\) of those ratios. Positive (blue shaded) values indicate evidence in favour of the Planck model, while negative (red shaded) values indicate the converse. For example, a value of +1 is interpreted as \(\MP\) being 10 times more likely than \(\MRJ\), while a value of -2 would suggest that \(\MP\) is 100 times less likely than \(\MRJ\). Noise levels are reported in two ways: \(s\) is the actual value used to generate the data, while “rel. \(σ\)” reports the resulting standard deviation as a fraction of the maximum radiance within the data window. The 20—1000 \(\mathrm{\mu m}\) window for \(λ\) stretches into the far infrared range, where the two models are nearly indistinguishable; hence higher positive values are expected for zero bias, low noise, and the 15—30 \(\mathrm{\mu m}\) window. As in \cref{fig_uv-example_r-distributions}, calculations were done for both positive and null bias conditions (resp. (\(\mathcal{B}_0 > 0\) and \(\mathcal{B}_0 = 0\)); the former emulates a situation where neither model can fit the data perfectly. Expressions for all criteria are given in the supplementary text. For the \(\underline{B}^{\mathrm{EMD}}_{\mathrm{P,RJ}}\) criteria, we used \(c=0.5\). Although this is the same value as we used for neuron model, it was determined through a separate calibration with different epistemic distributions. [source]

    0 > 0 0 = 0
    s 2¹² 2¹⁶ 2²⁰ 2¹² 2¹⁶ 2²⁰
    rel. σ 35% 9% 2% 36% 9% 2%
Criterion λ(μm) L            
BlP, RJ 20–1000 2⁹ -0.72 -0.74 5.64 -0.72 -0.82 4.62
2¹² 0.64 -0.41 14.49 0.62 -1.05 7.94
2¹⁵ -3.02 10.33 40.41 -3.23 8.11 -8.50
15–30 2⁹ 0.06 -0.35 3.27 0.06 -0.39 2.61
2¹² 0.52 3.86 52.20 0.49 3.53 47.12
2¹⁵ 3.04 23.02 389.62 2.87 20.33 348.85
BBayesP, RJ;π 20–1000 2⁹ 0.02 -0.01 0.02 -0.06 0.04 -6×10⁻³
2¹² 0.06 -0.03 -0.02 0.04 -0.03 0.07
2¹⁵ -0.05 -0.05 0.04 0.03 -0.03 -0.02
15–30 2⁹ -6×10⁻³ -0.03 0.01 -0.02 -7×10⁻³ -0.05
2¹² -4×10⁻³ -5×10⁻³ 0.04 -0.04 -0.06 -0.01
2¹⁵ 0.02 0.06 -0.05 9×10⁻³ 0.05 -0.01
BelpdP, RJ;π 20–1000 2⁹ -4×10⁻³ 3×10⁻⁵ 3×10⁻⁴ 2×10⁻⁴ 2×10⁻³ 5×10⁻⁴
2¹² -9×10⁻⁴ 1×10⁻³ -4×10⁻³ -3×10⁻³ -3×10⁻³ 7×10⁻⁴
2¹⁵ 2×10⁻⁴ 2×10⁻³ -2×10⁻³ 1×10⁻⁵ -7×10⁻⁵ -2×10⁻³
15–30 2⁹ -4×10⁻³ -2×10⁻³ -2×10⁻³ 6×10⁻⁴ 2×10⁻³ -4×10⁻⁴
2¹² -3×10⁻³ 1×10⁻³ 6×10⁻⁴ 6×10⁻⁴ -8×10⁻⁴ -2×10⁻³
2¹⁵ 6×10⁻⁴ -2×10⁻⁴ -1×10⁻³ -2×10⁻³ -2×10⁻³ -1×10⁻³
RP, RJ 20–1000 2⁹ 2×10⁻⁵ -5×10⁻⁴ 5×10⁻³ 2×10⁻⁵ -7×10⁻⁴ 4×10⁻³
2¹² 2×10⁻⁵ -5×10⁻⁴ 5×10⁻³ 2×10⁻⁵ -7×10⁻⁴ 4×10⁻³
2¹⁵ 2×10⁻⁵ -5×10⁻⁴ 5×10⁻³ 2×10⁻⁵ -7×10⁻⁴ 4×10⁻³
15–30 2⁹ 8×10⁻⁵ 6×10⁻⁴ 0.01 8×10⁻⁵ 6×10⁻⁴ 9×10⁻³
2¹² 8×10⁻⁵ 6×10⁻⁴ 0.01 8×10⁻⁵ 6×10⁻⁴ 9×10⁻³
2¹⁵ 8×10⁻⁵ 6×10⁻⁴ 0.01 8×10⁻⁵ 6×10⁻⁴ 9×10⁻³
EMDP, RJ 20–1000 2⁹ -0.04 0.11 0.33 -0.07 0.12 0.41
2¹² -0.01 0.07 0.34 -0.01 0.07 0.39
2¹⁵ -1×10⁻³ -0.04 0.26 -1×10⁻² -0.05 0.29
15–30 2⁹ -0.01 0.31 3.08 0.05 0.34 3.50
2¹² -0.03 0.40 1.87 -0.02 0.39 1.80
2¹⁵ -0.05 0.24 1.74 -0.04 0.25 1.66

6.2.1. Expressions used to compute selection criteria#

Conventions:

  • \(\D\): Training dataset used to fit the model.

  • \(L\): Number of data samples in \(\D\).

  • \((λ_i, \Bspec_i)\): Sample in \(\D\).

  • \((λ_j', \Bspec_j')\): Additional sample not in \(\D\) (i.e. a test sample).

  • \(\hat{σ}, \hat{T}\): Estimates of \(σ\) and \(T\) obtained by maximizing the likelihood on \(\D\).

EMD criterion
(6.4)#\[\begin{equation*} \log_{10} \underline{B}^{\mathrm{EMD}}_{AB;c} \coloneqq \log_{10} {\Bemd{AB;c}} - \log_{10} \bigl(1 - \Bemd{AB;c}\bigr) \,, \end{equation*}\]

where \(A = \M_{\mathrm{P}}\) and \(B = \M_{\mathrm{RJ}}\). (Repeated from Eq. 6.2.)

Loss function

As Eq. 2.5, we define the loss of a point \((λ_i, \Bspec_i)\) as the negative log likelihood given that data point. Since the candidate models assume Gaussian noise (Eq. 2.34), this is simply

(6.5)#\[\begin{split}\begin{aligned} Q_a({\Bspec}_i\mid λ_i,σ,T) &= -\log p\Bigl({\Bspec} \,\Bigm|\, {\Bspec}_{a}(λ; T), σ\Bigr) \notag\\ &= \log\! \sqrt{2π}σ + \frac{({\Bspec}-{\Bspec}_{a}(λ; T))^2}{2σ^2} \,. \end{aligned}\end{split}\]

Here the subscript \(a\) is used to indicate whether we use predictions from the Rayleigh-Jeans (Eq. 2.31) or Planck (Eq. 2.32) model.

Log likelihood function

Since we defined the loss as the negative log likelihood, one way to express the log likelihood for the whole dataset is simply as the negative total loss:

\[\logL_a(σ, T) \coloneqq \sum_{i=1}^L -Q_{a}(\Bspec_i \mid λ_i, σ, T) \,.\]
Risk

The risk is the expectation of the loss under the true model, so we have

\[R_a \coloneqq \Bigl\langle Q_{a}\Bigl(\Bspec_j' \mid λ_j', \hat{σ}_{a}, \hat{T}_a\Bigr) \Bigr\rangle_{λ_j', \Bspec_j' \sim \Mtrue} \,,\]
(6.6)#\[\log_{10} \underline{B}^R \coloneqq \bigl(-R_{\mathrm{P}} + R_{\mathrm{RJ}}\bigr) \;/\; \log 10 \,,\]

where \({λ_j', \Bspec_j' \sim \Mtrue}\) indicates that \(λ_j'\) and \(\Bspec_j'\) are random variables with the same probability as the data. (Repeated from Eq. 6.3)

Since in this case we know \(\Mtrue\), the expectation can be computed exactly. We did this by generating a very large number of data samples (\(L=2^{12}\)) and computing the empirical average of the loss:

\[R_a \approx -\frac{1}{L} \logL_a(\hat{σ}_a, \hat{T}_a) \,.\]
Bayesian prior

The Bayesian calculations require a prior on the parameters \(σ\) and \(T\). We used a simple prior which factorizes into two independent distributions:

\[\begin{split}\begin{aligned} π\Bigl(\log_2 σ\Big) &\sim \Unif\Bigl(\Bigl[\log_2 2^9, \log_2 2^{14}\Bigr]\Bigr) \,, \\ π\Bigl(\log_2 T\Bigr) &\sim \Unif\Bigl(\Bigl[\log_2 1000, \log_2 5000\Bigr]\Bigr) \,. \end{aligned}\end{split}\]

Here \(T\) is expressed in Kelvin and \(σ\) has units \({\mathrm{kW}} / {\left(\mathrm{m}^{2} \cdot \mathrm{nm} \cdot \mathrm{sr}\right)}\). We chose log uniform distributions because these are more appropriate for parameters which are strictly positive and which can span multiple scales: the logarithmic scaling captures the fact that the difference between 1000 K and 1001 K is more significant than the difference between 5000 K and 5001 K. Likewise for differences in the parameter \(σ\) at opposite ends of its range.

Expected log pointwise posterior predictive density (elpd)
(6.7)#\[\mathrm{elpd}_a \coloneqq \Bigl\langle \log_{10} p\bigl(λ_j', \Bspec_j' \mid a, \D\bigr) \Bigr\rangle_{λ_j', \Bspec_j' \sim \Mtrue}\]
\[\log_{10} B^{\mathrm{elpd}} \coloneqq \mathrm{elpd}_{\mathrm{P}} - \mathrm{elpd}_{\mathrm{RL}} \,.\]

Note \(p\) in Eq. 6.7 is a posterior density, and therefore evaluating it involves integrating over the prior:

\[p\bigl(λ_j', \Bspec_j' \mid a, \D\bigr) = \iint \! dσ dT \, π_σ(σ) π_T(T) \; p\bigl(λ_j', \Bspec_j' \mid a, \D, σ, T\bigr) \,.\]
Relative likelihood and AIC

The relative likelihood is given by

(6.8)#\[\log_{10} B^l \coloneqq \bigl(\logL_{\mathrm{P}}(\hat{σ}_{\mathrm{P}}, \hat{T}_{\mathrm{P}}) - \logL_{\mathrm{RL}}(\hat{σ}_{\mathrm{P}}, \hat{T}_{\mathrm{P}})\bigr) \;/\; \log 10\,,\]

while the difference between the AIC criteria of both models is (we use here the fact that both models have the same number of parameters)

\[ΔAIC \coloneqq 2\logL_{\mathrm{P}}(\hat{σ}_{\mathrm{P}}, \hat{T}_{\mathrm{P}}) - 2\logL_{\mathrm{RL}}(\hat{σ}_{\mathrm{P}}, \hat{T}_{\mathrm{P}}) \,.\]

Since the two are equivalent up to a factor \(2 \log 10\), the trends we see with the likelihood ratio therefore also occur with the AIC.

(The factor 2 in the AIC is meant to allow it to be interpreted – under certain assumptions – as a draw from a \(χ^2\) distribution. This correspondence however is not required when interpreting Table 6.1.)

Model evidence

The model evidence is used to compute the Bayes factors. It is the expectation of the likelihood of the data – \(\D = \{λ_i, \Bspec_i\}_{i=1}^L\) – under the prior for \(T\) and \(σ\):

\[\begin{split}\begin{aligned} \eE_a &\coloneqq \iint \! dσ dT \, π_σ(σ) π_T(T) \; p\Bigl(\{λ_i, \Bspec_i\}_{i=1}^L \mid a, σ, T\Bigr) \notag\\ &\vphantom{:}= \iint \! dσ dT \, π_σ(σ) π_T(T) \logL_a(σ,T) \;\,. \end{aligned}\end{split}\]

The likelihood \(p\Bigl(\{λ_i, \Bspec_i\}_{i=1}^L \mid a, σ, T\Bigr)\) is given by the Gaussian observation model.

Bayes factor
\[\log_{10} B^B \coloneqq \log_{10} \eE_{\mathrm{P}} - \log_{10} \eE_{\mathrm{RL}} \,.\]