The hierarchical beta process

4.6. The hierarchical beta process#

\(\require{mathtools} \newcommand{\notag}{} \newcommand{\tag}{} \newcommand{\label}[1]{} \newcommand{\sfrac}[2]{#1/#2} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\num}[1]{#1} \newcommand{\qty}[2]{#1\,#2} \renewenvironment{align} {\begin{aligned}} {\end{aligned}} \renewenvironment{alignat} {\begin{alignedat}} {\end{alignedat}} \newcommand{\pdfmspace}[1]{} % Ignore PDF-only spacing commands \newcommand{\htmlmspace}[1]{\mspace{#1}} % Ignore PDF-only spacing commands \newcommand{\scaleto}[2]{#1} % Allow to use scaleto from scalerel package \newcommand{\RR}{\mathbb R} \newcommand{\NN}{\mathbb N} \newcommand{\PP}{\mathbb P} \newcommand{\EE}{\mathbb E} \newcommand{\XX}{\mathbb X} \newcommand{\ZZ}{\mathbb Z} \newcommand{\QQ}{\mathbb Q} \newcommand{\fF}{\mathcal F} \newcommand{\dD}{\mathcal D} \newcommand{\lL}{\mathcal L} \newcommand{\gG}{\mathcal G} \newcommand{\hH}{\mathcal H} \newcommand{\nN}{\mathcal N} \newcommand{\pP}{\mathcal P} \newcommand{\BB}{\mathbb B} \newcommand{\Exp}{\operatorname{Exp}} \newcommand{\Binomial}{\operatorname{Binomial}} \newcommand{\Poisson}{\operatorname{Poisson}} \newcommand{\linop}{\mathcal{L}(\mathbb{B})} \newcommand{\linopell}{\mathcal{L}(\ell_1)} \DeclareMathOperator{\trace}{trace} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Span}{span} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\col}{col} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\gt}{>} \definecolor{highlight-blue}{RGB}{0,123,255} % definition, theorem, proposition \definecolor{highlight-yellow}{RGB}{255,193,7} % lemma, conjecture, example \definecolor{highlight-orange}{RGB}{253,126,20} % criterion, corollary, property \definecolor{highlight-red}{RGB}{220,53,69} % criterion \newcommand{\logL}{\ell} \newcommand{\eE}{\mathcal{E}} \newcommand{\oO}{\mathcal{O}} \newcommand{\defeq}{\stackrel{\mathrm{def}}{=}} \newcommand{\Bspec}{\mathcal{B}} % Spectral radiance \newcommand{\X}{\mathcal{X}} % X space \newcommand{\Y}{\mathcal{Y}} % Y space \newcommand{\M}{\mathcal{M}} % Model \newcommand{\Tspace}{\mathcal{T}} \newcommand{\Vspace}{\mathcal{V}} \newcommand{\Mtrue}{\mathcal{M}_{\mathrm{true}}} \newcommand{\MP}{\M_{\mathrm{P}}} \newcommand{\MRJ}{\M_{\mathrm{RJ}}} \newcommand{\qproc}{\mathfrak{Q}} \newcommand{\D}{\mathcal{D}} % Data (true or generic) \newcommand{\Dt}{\tilde{\mathcal{D}}} \newcommand{\Phit}{\widetilde{\Phi}} \newcommand{\Phis}{\Phi^*} \newcommand{\qt}{\tilde{q}} \newcommand{\qs}{q^*} \newcommand{\qh}{\hat{q}} \newcommand{\AB}[1]{\mathtt{AB}~\mathtt{#1}} \newcommand{\LP}[1]{\mathtt{LP}~\mathtt{#1}} \newcommand{\NML}{\mathrm{NML}} \newcommand{\iI}{\mathcal{I}} \newcommand{\true}{\mathrm{true}} \newcommand{\dist}{D} \newcommand{\Mtheo}[1]{\mathcal{M}_{#1}} % Model (theoretical model); index: param set \newcommand{\DL}[1][L]{\mathcal{D}^{(#1)}} % Data (RV or generic) \newcommand{\DLp}[1][L]{\mathcal{D}^{(#1')}} % Data (RV or generic) \newcommand{\DtL}[1][L]{\tilde{\mathcal{D}}^{(#1)}} % Data (RV or generic) \newcommand{\DpL}[1][L]{{\mathcal{D}'}^{(#1)}} % Data (RV or generic) \newcommand{\Dobs}[1][]{\mathcal{D}_{\mathrm{obs}}^{#1}} % Data (observed) \newcommand{\calibset}{\mathcal{C}} \newcommand{\N}{\mathcal{N}} % Normal distribution \newcommand{\Z}{\mathcal{Z}} % Partition function \newcommand{\VV}{\mathbb{V}} % Variance \newcommand{\T}{\mathsf{T}} % Transpose \newcommand{\EMD}{\mathrm{EMD}} \newcommand{\dEMD}{d_{\mathrm{EMD}}} \newcommand{\dEMDtilde}{\tilde{d}_{\mathrm{EMD}}} \newcommand{\dEMDsafe}{d_{\mathrm{EMD}}^{\text{(safe)}}} \newcommand{\e}{ε} % Model confusion threshold \newcommand{\falsifythreshold}{ε} \newcommand{\bayes}[1][]{B_{#1}} \newcommand{\bayesthresh}[1][]{B_{0}} \newcommand{\bayesm}[1][]{B^{\mathcal{M}}_{#1}} \newcommand{\bayesl}[1][]{B^l_{#1}} \newcommand{\bayesphys}[1][]{B^{{p}}_{#1}} \newcommand{\Bconf}[1]{B^{\mathrm{epis}}_{#1}} \newcommand{\Bemd}[1]{B^{\mathrm{EMD}}_{#1}} \newcommand{\Bconfbin}[1][]{\bar{B}^{\mathrm{conf}}_{#1}} \newcommand{\Bemdbin}[1][]{\bar{B}_{#1}^{\mathrm{EMD}}} \newcommand{\bin}{\mathcal{B}} \newcommand{\Bconft}[1][]{\tilde{B}^{\mathrm{conf}}_{#1}} \newcommand{\fc}{f_c} \newcommand{\fcbin}{\bar{f}_c} \newcommand{\paramphys}[1][]{Θ^{{p}}_{#1}} \newcommand{\paramobs}[1][]{Θ^{ε}_{#1}} \newcommand{\test}{\mathrm{test}} \newcommand{\train}{\mathrm{train}} \newcommand{\synth}{\mathrm{synth}} \newcommand{\rep}{\mathrm{rep}} \newcommand{\MNtrue}{\mathcal{M}^{{p}}_{\text{true}}} \newcommand{\MN}[1][]{\mathcal{M}^{{p}}_{#1}} \newcommand{\MNA}{\mathcal{M}^{{p}}_{Θ_A}} \newcommand{\MNB}{\mathcal{M}^{{p}}_{Θ_B}} \newcommand{\Me}[1][]{\mathcal{M}^ε_{#1}} \newcommand{\Metrue}{\mathcal{M}^ε_{\text{true}}} \newcommand{\Meobs}{\mathcal{M}^ε_{\text{obs}}} \newcommand{\Meh}[1][]{\hat{\mathcal{M}}^ε_{#1}} \newcommand{\MNa}{\mathcal{M}^{\mathcal{N}}_a} \newcommand{\MeA}{\mathcal{M}^ε_A} \newcommand{\MeB}{\mathcal{M}^ε_B} \newcommand{\Ms}{\mathcal{M}^*} \newcommand{\MsA}{\mathcal{M}^*_A} \newcommand{\MsB}{\mathcal{M}^*_B} \newcommand{\Msa}{\mathcal{M}^*_a} \newcommand{\MsAz}{\mathcal{M}^*_{A,z}} \newcommand{\MsBz}{\mathcal{M}^*_{B,z}} \newcommand{\Msaz}{\mathcal{M}^*_{a,z}} \newcommand{\MeAz}{\mathcal{M}^ε_{A,z}} \newcommand{\MeBz}{\mathcal{M}^ε_{B,z}} \newcommand{\Meaz}{\mathcal{M}^ε_{a,z}} \newcommand{\zo}{z^{0}} \renewcommand{\lL}[2][]{\mathcal{L}_{#1|{#2}}} % likelihood \newcommand{\Lavg}[2][]{\mathcal{L}^{/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLphys}[2][]{\mathcal{L}^{{p}}_{#1|#2}} \newcommand{\Lavgphys}[2][]{\mathcal{L}^{{p}/#2}_{#1}} % Geometric average of likelihood \newcommand{\lLL}[3][]{\mathcal{L}^{(#3)}_{#1|#2}} \newcommand{\lLphysL}[3][]{\mathcal{L}^{{p},(#3)}_{#1|#2}} \newcommand{\lnL}[2][]{l_{#1|#2}} % Per-sample log likelihood \newcommand{\lnLt}[2][]{\widetilde{l}_{#1|#2}} \newcommand{\lnLtt}{\widetilde{l}} % Used only in path_sampling \newcommand{\lnLh}[1][]{\hat{l}_{#1}} \newcommand{\lnLphys}[2][]{l^{{p}}_{#1|#2}} \newcommand{\lnLphysL}[3][]{l^{{p},(#3)}_{#1|#2}} \newcommand{\Elmu}[2][1]{μ_{{#2}}^{(#1)}} \newcommand{\Elmuh}[2][1]{\hat{μ}_{{#2}}^{(#1)}} \newcommand{\Elsig}[2][1]{Σ_{{#2}}^{(#1)}} \newcommand{\Elsigh}[2][1]{\hat{Σ}_{{#2}}^{(#1)}} \newcommand{\pathP}{\mathop{{p}}} % Path-sampling process (generic) \newcommand{\pathPhb}{\mathop{{p}}_{\mathrm{Beta}}} % Path-sampling process (hierarchical beta) \newcommand{\interval}{\mathcal{I}} \newcommand{\Phiset}[1]{\{\Phi\}^{\small (#1)}} \newcommand{\Phipart}[1]{\{\mathcal{I}_Φ\}^{\small (#1)}} \newcommand{\qhset}[1]{\{\qh\}^{\small (#1)}} \newcommand{\Dqpart}[1]{\{Δ\qh_{2^{#1}}\}} \newcommand{\LsAzl}{\mathcal{L}_{\smash{{}^{\,*}_A},z,L}} \newcommand{\LsBzl}{\mathcal{L}_{\smash{{}^{\,*}_B},z,L}} \newcommand{\lsA}{l_{\smash{{}^{\,*}_A}}} \newcommand{\lsB}{l_{\smash{{}^{\,*}_B}}} \newcommand{\lsAz}{l_{\smash{{}^{\,*}_A},z}} \newcommand{\lsAzj}{l_{\smash{{}^{\,*}_A},z_j}} \newcommand{\lsAzo}{l_{\smash{{}^{\,*}_A},z^0}} \newcommand{\leAz}{l_{\smash{{}^{\,ε}_A},z}} \newcommand{\lsAez}{l_{\smash{{}^{*ε}_A},z}} \newcommand{\lsBz}{l_{\smash{{}^{\,*}_B},z}} \newcommand{\lsBzj}{l_{\smash{{}^{\,*}_B},z_j}} \newcommand{\lsBzo}{l_{\smash{{}^{\,*}_B},z^0}} \newcommand{\leBz}{l_{\smash{{}^{\,ε}_B},z}} \newcommand{\lsBez}{l_{\smash{{}^{*ε}_B},z}} \newcommand{\LaszL}{\mathcal{L}_{\smash{{}^{*}_a},z,L}} \newcommand{\lasz}{l_{\smash{{}^{*}_a},z}} \newcommand{\laszj}{l_{\smash{{}^{*}_a},z_j}} \newcommand{\laszo}{l_{\smash{{}^{*}_a},z^0}} \newcommand{\laez}{l_{\smash{{}^{ε}_a},z}} \newcommand{\lasez}{l_{\smash{{}^{*ε}_a},z}} \newcommand{\lhatasz}{\hat{l}_{\smash{{}^{*}_a},z}} \newcommand{\pasz}{p_{\smash{{}^{*}_a},z}} \newcommand{\paez}{p_{\smash{{}^{ε}_a},z}} \newcommand{\pasez}{p_{\smash{{}^{*ε}_a},z}} \newcommand{\phatsaz}{\hat{p}_{\smash{{}^{*}_a},z}} \newcommand{\phateaz}{\hat{p}_{\smash{{}^{ε}_a},z}} \newcommand{\phatseaz}{\hat{p}_{\smash{{}^{*ε}_a},z}} \newcommand{\Phil}[2][]{Φ_{#1|#2}} % Φ_{\la} \newcommand{\Philt}[2][]{\widetilde{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philhat}[2][]{\hat{Φ}_{#1|#2}} % Φ_{\la} \newcommand{\Philsaz}{Φ_{\smash{{}^{*}_a},z}} % Φ_{\lasz} \newcommand{\Phileaz}{Φ_{\smash{{}^{ε}_a},z}} % Φ_{\laez} \newcommand{\Philseaz}{Φ_{\smash{{}^{*ε}_a},z}} % Φ_{\lasez} \newcommand{\mus}[1][1]{μ^{(#1)}_*} \newcommand{\musA}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\SigsA}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_A}}} \newcommand{\musB}[1][1]{μ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\SigsB}[1][1]{Σ^{(#1)}_{\smash{{}^{\,*}_B}}} \newcommand{\musa}[1][1]{μ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Sigsa}[1][1]{Σ^{(#1)}_{\smash{{}^{*}_a}}} \newcommand{\Msah}{{\color{highlight-red}\mathcal{M}^{*}_a}} \newcommand{\Msazh}{{\color{highlight-red}\mathcal{M}^{*}_{a,z}}} \newcommand{\Meah}{{\color{highlight-blue}\mathcal{M}^{ε}_a}} \newcommand{\Meazh}{{\color{highlight-blue}\mathcal{M}^{ε}_{a,z}}} \newcommand{\lsazh}{{\color{highlight-red}l_{\smash{{}^{*}_a},z}}} \newcommand{\leazh}{{\color{highlight-blue}l_{\smash{{}^{ε}_a},z}}} \newcommand{\lseazh}{{\color{highlight-orange}l_{\smash{{}^{*ε}_a},z}}} \newcommand{\Philsazh}{{\color{highlight-red}Φ_{\smash{{}^{*}_a},z}}} % Φ_{\lasz} \newcommand{\Phileazh}{{\color{highlight-blue}Φ_{\smash{{}^{ε}_a},z}}} % Φ_{\laez} \newcommand{\Philseazh}{{\color{highlight-orange}Φ_{\smash{{}^{*ε}_a},z}}} % Φ_{\lasez} \newcommand{\emdstd}{\tilde{σ}} \DeclareMathOperator{\Mvar}{Mvar} \DeclareMathOperator{\AIC}{AIC} \DeclareMathOperator{\epll}{epll} \DeclareMathOperator{\elpd}{elpd} \DeclareMathOperator{\MDL}{MDL} \DeclareMathOperator{\comp}{COMP} \DeclareMathOperator{\Lognorm}{Lognorm} \DeclareMathOperator{\erf}{erf} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\Image}{Image} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\SE}{SE} % standard error \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\SkewNormal}{SkewNormal} \DeclareMathOperator{\TruncNormal}{TruncNormal} \DeclareMathOperator{\Exponential}{Exponential} \DeclareMathOperator{\exGaussian}{exGaussian} \DeclareMathOperator{\IG}{IG} \DeclareMathOperator{\NIG}{NIG} \DeclareMathOperator{\Gammadist}{Gamma} \DeclareMathOperator{\Lognormal}{Lognormal} \DeclareMathOperator{\Beta}{Beta} \newcommand{\sinf}{{s_{\infty}}}\)

In this work we identify the epistemic uncertainty of a model \(\M_A\) with the variability of a stochastic process \(\qproc_A\): realizations[1] of \(\qproc_A\) approximate the PPF of the model loss. In our Results we listed desiderata which \(\qproc_A\) should satisfy and proposed that it be described as a hierarchical beta (HB) process. However we deferred providing a precise definition for \(\qproc_A\); we do this now in the form of Algorithm 4.1. The rest of this section explains the theoretical justifications for each step of this generative algorithm for \(\qproc_A\).

Algorithm 4.1 (Hierarchical beta process)

Given

\(\qs_A\), \(c\) and \(δ_A^{\EMD}\);
refinement level \(N\);
distribution for the end points: \(p\bigl(\qh(0), \qh(1)\bigr)\).

Generate

a realization \(\qh_A\).

Procedure:

Draw end points \(\bigl(\qh(0), \qh(1)\bigr) \sim p\bigl(\qh(0), \qh(1)\bigr)\).
Repeat until \(\qh(0) < \qh(1)\).
For each \(n \in 1, 2, \dotsc, N\):
1. For each \(Φ \in \{k\cdot 2^{-n+1}\}_{k=0}^{2^{n-1}-1}\):
  1. Compute \(r, v\) according to Eqs. Eq. 4.12.
  2. Solve Eq. 4.13 to obtain \(α\) and \(β\).
  3. Draw \(x_1 \sim \Beta(α, β)\).
    \(\qh(Φ+2^{-n}) \leftarrow \qh(Φ) + Δ\qh_{ΔΦ}(Φ) \cdot x_1\).
Return \(\bigl(\qh_A(Φ) : Φ \in \Phipart{N}\bigr)\).

The quantities \(r\) and \(v\) conceptually represent the ratio between two sucessive increments and the variance of those increments.

4.6.1. Relevant concepts of Wiener processes#

Before introducing the HB process, let us first review a few key properties which stochastic processes must satisfy and which are covered in most standard introductions [67, 68, 69]. We use for this the well-known Wiener process, and also introdue notation which will become useful when we define the HB process. Since our goal is to define a process for PPFs, we use \(Φ\) to denote the independent “domain” variable and restrict ourselves to 1-d processes for which \(Φ \in [0, 1]\).

For the Wiener process \(\mathcal{W}\), each realization is a continuous function \(W\colon [0, 1] \to \RR\). One way to approximate a realization of \(\mathcal{W}\) is to first partition the interval into subintervals \([0, Φ_1),\, [Φ_1, Φ_2),\, \dotsc,\, [Φ_n, 1)\) with \(0 < Φ_1 < Φ_2 < \dotsb < Φ_n < 1\); for simplicity we will only consider equal-sized subintervals, so that \(Φ_k = k\,ΔΦ\) for some \(ΔΦ \in \RR\). We then generate a sequence of independent random increments (one for each subinterval) \(\bigl\{ΔW_{ΔΦ}(0),\, ΔW_{ΔΦ}(ΔΦ),\, ΔW_{ΔΦ}(2 ΔΦ),\, \dotsc,\, ΔW_{ΔΦ}(1-ΔΦ)\bigr\}\) and define the corresponding realization as

(4.4)#\[W(k\,ΔΦ) = W(0) + \sum_{l=0}^{k-1} ΔW_{ΔΦ}(l\,ΔΦ) = W\bigl(0\bigr) + ΔW_{ΔΦ}\bigl((k-1)\,ΔΦ\bigr)\,.\]

(Within each interval the function may be linearly interpolated, so that \(W\) is continuous.)

A refinement of a partition is obtained by taking each subinterval and further dividing it into smaller subintervals. For instance we can refine the unit interval \([0, 1)\) into a set of two subintervals, \([0, 2^{-1})\) and \([2^{-1}, 2^{-0})\). Let us denote these partitions \(\Phipart{0}\) and \(\Phipart{1}\) respectively. Repeating the process on each subinterval yields a sequence of ever finer refinements:

(4.5)#\[\begin{split}\begin{alignedat}{2} \Phipart{0} &\coloneqq \Bigl\{\, \bigl[\, 0,\; {\small 2}^{-0} \,\bigr) \,\Bigr\} \notag \\ \Phipart{1} &\coloneqq \Bigl\{\, \bigl[\, 0,\; {\small 2}^{-1} \,\bigr),\, \bigl[\, {\small 2}^{-1},\; {\small 2}^{-0} \,\bigr) \,\Bigr\} \notag \\ \Phipart{2} &\coloneqq \Bigl\{\, \bigl[\, 0,\; {\small 2}^{-2} \,\bigr),\, \bigl[\, {\small 2}^{-2},\; {\small 2}^{-1} \,\bigr) ,\, \bigl[\, {\small 2}^{-1},\; 3\cdot {\small 2}^{-2} \,\bigr) ,\, \bigl[\, 3\cdot {\small 2}^{-2},\; {\small 2}^{-0} \,\bigr) \,\Bigr\} \notag \\ \vdots \notag \\ \Phipart{n} &\coloneqq \Bigl\{ \,\bigl[\, k\!\cdot\! {\small 2}^{-n},\; (k\!+\!1)\!\cdot\! {\small 2}^{-n} \,\bigr)\, \Bigr\}_{k=0}^{2^n - 1} \,. \end{alignedat}\end{split}\]

With these definitions, for any \(m \geq n\), \(\Phipart{m}\) is a refinement of \(\Phipart{n}\).

Later we will need to refer to the vector of new end points introduced at the \(n\)-th refinemement step. These are exactly the odd multiples of \(2^{-n}\) between 0 and 1, which we denote \(\Phiset{n}\):

(4.6)#\[\Phiset{n} \coloneqq \Bigl( \scaleto{ {\small 2}^{-n}}{6pt},\; 3\!\cdot\! \scaleto{{\small 2}^{-n}}{6pt},\, \dotsc\;, ({\small 2}^n\!-\!3)\!\cdot\! \scaleto{{\small 2}^{-n}}{6pt},\; ({\small 2}^n\!-\!1)\!\cdot\! \scaleto{{\small 2}^{-n}}{6pt} \,\Bigr) \,.\]

Any random process must be self-consistent [40]: for small enough \(ΔΦ\), the probability distribution at a point \(Φ\) must not depend on the level of refinement. For example, the Wiener process is defined such that the increments \({ΔW_{ΔΦ} \sim \nN(0, ΔΦ)}\) are independent; therefore

\[\begin{split}\begin{alignedat}{5} W(Φ \!+\! 2ΔΦ) &= W(Φ) &&+ ΔW_{2ΔΦ}(Φ) &&= W(Φ) &&+ ΔW_{ΔΦ}(Φ) &&+ ΔW_{ΔΦ}(Φ\!+\!ΔΦ) \notag \\ &= W(Φ) &&+ \nN\bigl(0, 2ΔΦ\bigr) &&= W(Φ) &&+ \nN\bigl(0, ΔΦ\bigr) &&+ \nN\bigl(0, ΔΦ\bigr) \,. \end{alignedat}\end{split}\]

It turns out that the combination of the Markovian and self-consistent properties set quite strong requirements on the stochastic increments, since they impose the square root scaling of the Wiener increment: \(\oO(ΔW_{ΔΦ}) = \oO(\sqrt{ΔΦ})\). (§II.C of Gillespie [40].)

While the Wiener process underlies much of stochastic theory, it is not suitable for defining a process \(\qproc\) over PPFs. Indeed, it is not monotone by design, which violates one of our desiderata. Moreover, it has a built-in directionality in the form of accumulated increments. A clear symptom of this is that as \(Φ\) increases, the variance of \(W(Φ)\) also increases. (This follows immediately from Eq. 4.4 and the independence of increments.) Directionality makes sense if we think of \(W\) as modelling the diffusion of particles in space or time, but empirical PPFs are obtained by first sorting data samples according to their loss (see the definition of \(δ^{\EMD}\) in the Results). Since samples of the loss arrive in no particular order, a process which samples PPFs should likewise have no intrinsic directionality in \(Φ\).

4.6.2. A hierarchical beta distribution is monotone, non-accumulating and self-consistent#

Constructing a stochastic process \(\qproc\) which is monotone is relatively simple: one only needs to ensure that the random increments \(Δ\qh_{ΔΦ}(Φ)\) are non-negative.

Ensuring that those increments are non-accumulating requires more care, because that requirement invalidates most common definitions of stochastic processes. As described in our Results, we achieve this by defining \(\qproc\) as a sequence of refinements, starting from a single increment for the entire interval, then doubling the number of increments (and halving their width) at each refinement step. In the rest of this subsection we give an explicit construction of this process and show that it is also self-consistent. (Altough in this work we consider only pairs of increments sampled from a beta distribution, in general one could consider other compositional distributions. Higher-dimensional distributions may allow to sample all increments simultaneously, if one can determine the conditions which ensure self-consistency.)

For an interval \(\interval = [Φ, Φ+ΔΦ)\), we suppose that the points \(\qh_A(Φ)\) and \(\qh_A(Φ + ΔΦ)\) are given. We define

(4.7)#\[Δ\qh(\interval) \;=\; Δ\qh_{ΔΦ}(Φ) \;\coloneqq\; \qh_A(Φ + ΔΦ) - \qh_A(Φ) \,,\]

then we draw a subincrement \(Δ\qh_{\tfrac{ΔΦ}{2}}(Φ)\), associated to the subinterval \([Φ, Φ+ΔΦ)\), from a scaled beta distribution:

(4.8)#\[\frac{1}{Δ\qh_{ΔΦ}(Φ)} Δ\qh_{\tfrac{ΔΦ}{2}}(Φ) \coloneqq x_1 \sim \Beta(α, β) \,.\]

(Refer to the subsection below for the definition of \(\Beta(α,β)\).) The scaling is chosen so that

\[0 \leq \underbrace{Δ\qh_{\tfrac{ΔΦ}{2}}(Φ)}_{=\, x_1 Δ\qh_{ΔΦ}(Φ)} \leq Δ\qh_{ΔΦ}(Φ)\,.\]

The value of \(Δ\qh_{\tfrac{ΔΦ}{2}}(Φ)\) then determines the intermediate point:

\[\qh_A\bigl(Φ + \tfrac{ΔΦ}{2}\bigr) \leftarrow \qh_A(Φ) + Δ\qh_{\tfrac{ΔΦ}{2}}(Φ) \,.\]

If desired, the complementary increment \(Δ\qh_{\tfrac{ΔΦ}{2}}\bigl(Φ+\tfrac{ΔΦ}{2}\bigr)\) can be obtained as

\[Δ\qh_{\tfrac{ΔΦ}{2}}\bigl(Φ+\tfrac{ΔΦ}{2}\bigr) = (1-x_1) Δ\qh_{ΔΦ}(Φ) \,.\]

Generalizing the notation to the entire \([0,1)\) interval, we start from a sequence of increments associated to subintervals at refinement step \(n\) (recall Eq. 4.5):

\[\begin{split}\begin{alignedat}{4} \Dqpart{-n} &\coloneqq\, &&\bigl\{ Δ\qh(\interval) &&:& \interval &\in \Phipart{n} \bigr\} \\ &\stackrel{\mathrm{def}}{=}\, &&\bigl\{ Δ\qh_{{\small 2}^{-n}}(Φ) &&:& \bigl[Φ, Φ+2^{-n}\bigr) &\in \Phipart{n} \bigr\} \,. \end{alignedat}\end{split}\]

Applying the procedure just described, for each subinterval we draw \(x_1\) and split the corresponding increment \(Δ\qh_{{\small 2}^{-n}}(Φ) \in \Dqpart{-n}\) into a pair \(\bigl(Δ\qh_{{\small 2}^{-n-1}}(Φ), Δ\qh_{{\small 2}^{-n-1}}(Φ+2^{-n-1})\bigr)\) of subincrements such that

(4.9)#\[Δ\qh_{{\small 2}^{-n}}(Φ) = Δ\qh_{{\small 2}^{-n-1}}(Φ) + Δ\qh_{{\small 2}^{-n-1}}(Φ+2^{-n-1}) \,.\]

The union of subincrements is then the next refinement step:

\[\bigl\{ Δ\qh(\interval) : \interval \in \Phipart{n+1} \bigr\} = \Dqpart{-n-1} \,.\]

After \(n\) refinement steps, we thus obtain a function \(\qh(Φ)\) defined at discrete points:

\[\qh^{\small (n)}(k\cdot 2^{-n}) \coloneqq \qh(0) + \sum_{l<k} Δ\qh_{{\small 2}^{-n}}(l\cdot 2^{-n}) \,,\]

which we extend to the entire interval \([0, 1)\) by linear interpolation; see Fig. 2.4d for an illustration. In practice we found that computations (specifically the risk computed by integrating \(\qh^{\small (n)}(Φ)\)) converge after about eight refinement steps.

This procedure has the important property that once a point is sampled, it does not change on further refinements:

(4.10)#\[\qh^{\small (n)}(k\cdot 2^{-n}) = \qh^{\small (m)}(k\cdot 2^{-n}) \,, \quad \forall m \geq n \,,\]

which follows from Eq. 4.9. Recall now that, as stated above, a process is self-consistent if “for small enough \(ΔΦ\), the probability distribution at a point \(Φ\) [does] not depend on the level of refinement”. Since Eq. 4.10 clearly satisfies that requirement, we see that the process obtained after infinitely many refinement steps is indeed self-consistent. We thus define the hierarchical beta (HB) process \(\qproc_A\) as

\[p\bigl(\qh \mid \qproc_A\bigr) \coloneqq p\Bigl(\qh = \lim_{n \to \infty} \qh^{\small (n)} \,\Bigm|\, c, \qs_A, δ^{\EMD}_A\Bigr) \,.\]

To complete the definition of \(\qproc_A\), we need to specify how we choose the initial end points \(\qh_A(0)\) and \(\qh_A(1)\). In our code implementation, they are drawn from normal distributions \(\nN\bigl(\qs(Φ), \sqrt{c}\,δ_A^{\EMD}(Φ)^2\bigr)\) with \(Φ \in \{0, 1\}\), where again \(c\) is determined via our proposed calibration procedure; this is simple and convenient, but otherwise arbitrary. We also need to explain how we choose the beta parameters \(α\) and \(β\), which is the topic of the next subsection.

4.6.3. Choosing beta distribution parameters#

All HB processes are monotone, continuous and self-consistent, but within this class there is still a lot of flexibility: since \(α\) and \(β\) are chosen independently for each subinterval, we can mold \(\qproc_A\) into a wide variety of statistical shapes. We use this flexibility to satisfy the two remaining desiderata: a) that \(\qh_A(Φ)\) realizations track \(\qs_A(Φ)\) over \(Φ \in [0, 1]\); and b) that the variability of \(\qh_A(Φ)\) be proportional to \(δ_A^{\EMD}(Φ)\). It is the goal of this subsection to give a precise mathematical meaning to those requirements.

Let \({x_1 \sim \Beta(α, β)}\) and \({x_2 = 1-x_1}\). (The density function of a beta distribution is given in Eq. 2.20.) The mean and variance of \(x_1\) are

(4.11)#\[\begin{split}\begin{aligned} \EE[x_1] &= \frac{α}{α+β} \,, \label{eq_beta-cumulants__EE} \\ \VV[x_1] &= \frac{αβ}{(α+β)^2(α+β+1)} \,. \label{eq_beta-cumulants__VV} \end{aligned}\end{split}\]

For a given \(Φ\), it may seem natural to select \(α\) and \(β\) by matching \(\EE[x_1]\) to \(\qs_A(Φ)\) and \(\VV[x_1]\) to \(c\!\cdot\!\bigl(δ_A^{\EMD}(Φ)\bigr)^2\). However both equations are tightly coupled, and we found that numerical solutions were unstable and unsatisfactory; in particular, it is not possible to make the variance large when \(\EE[x_1]\) approaches either 0 or 1 (otherwise the distribution of \(x_1\) would exceed \([0, 1]\)).

Much more practical is to consider moments with respect to the Aitchison measure; as mentioned in the Results, the Aitchison measure first maps the bounded interval \([0, 1]\) to the unbounded space \(\RR\) with a logistic transformation, then computes moments in the unbounded space. The first two such moments are called the centre and the metric variance [41, 42]; for the beta distribution, they are given by (reproduced from Eq. 2.21)

\[\begin{split}\begin{aligned} \EE_a[(x_1, x_2)] &= \frac{1}{e^{ψ(α)} + e^{ψ(β)}} \bigl(e^{ψ(α)}, e^{ψ(β)}\bigr) \,, \\ \Mvar[(x_1, x_2)] &= \frac{1}{2} \bigl(ψ_1(α) + ψ_1(β)\bigr) \,, \end{aligned}\end{split}\]

where \(ψ\) and \(ψ_1\) are the digamma and trigamma functions respectively. The centre and metric variance are known to be more natural statistics for compositional distributions, and this is what we found in practice. Therefore we will relate \(\qs_A\) to the centre, and \(\sqrt{c} \, δ_A^{\EMD}\) to the square root of the metric variance.

To be precise, suppose that we have already selected a set of increments \(\Dqpart{-n}\) over the domain \(\{k\cdot {\small 2}^{-n}\}_{k=0}^{n-1}\), and wish to produce the refinement \(\Dqpart{-n-1}\). For each \(Φ\) in \(\{k\cdot {\small 2}^{-n}\}_{k=0}^{n-1}\) we define

(4.12)#\[\begin{split}\begin{aligned} r &\coloneqq \frac{\qs_A(Φ+2^{-n-1}) - \qs_A(Φ)}{\qs_A(Φ+2^{-n}) - \qs_A(Φ+2^{-n-1})} \,, \label{eq_def-r-v__r} \\ v &\coloneqq 2 \,c \, \Bigl(δ_A^{\EMD} \bigl(Φ + 2^{-n-1}\bigr)\Bigr)^2 \,. \label{eq_def-r-v__v} \end{aligned}\end{split}\]

The value \(r\) is the ratio of subincrements of \(Δ\qs_{{\small 2}^{-n}}(Φ)\). Since we want \(\qh_A\) to track \(\qs_A\), it makes sense to expect \(r\) to also approximate the ratio of subincrements of \(Δ\qh_{2^{-n}}(Φ)\). Identifying

\[r \leftrightarrow \frac{\EE_a[x_1]}{\EE_a[x_2]} \quad \text{and} \quad v \leftrightarrow 2 \cdot \Mvar[(x_1, x_2)] \,,\]

and substituting into Eq. 2.21 then leads to the following system of equations:

(4.13)#\[\begin{split}\begin{aligned} ψ(α) - ψ(β) &= \ln r \label{eq_root-problem-for-alpha-beta__r} \\ \ln\bigl[ ψ_1(α) + ψ_1(β) \bigr] &= \ln v \,, \label{eq_root-problem-for-alpha-beta__v} \end{aligned}\end{split}\]

which we can solve to yield the desired parameters \(α\) and \(β\) for each subinterval in \(\Phipart{n}\).

In summary, although the justification is somewhat technical, the actual procedure for obtaining \(α\) and \(β\) is quite simple: first compute \(r\) and \(v\) following Eq. 4.12, then solve Eq. 4.13.

The hierarchical beta process

Contents

4.6. The hierarchical beta process#

4.6.1. Relevant concepts of Wiener processes#

4.6.2. A hierarchical beta distribution is monotone, non-accumulating and self-consistent#

4.6.3. Choosing beta distribution parameters#