Probabilistic Deep Learning - Adnan Harun Dogan

Probabilistic Foundations for Deep Learning

Thesis. Probabilistic inference cannot be separated from the learning algorithm and its scalability constraints. Classical graphical models (HMMs, Kalman filters, LDA) solve inference exactly, but only for special structures. When neural networks are used as model components, structural constraints dissolve and inference becomes intractable. That intractability is not a side-effect to be engineered away: it is the organizing pressure that forces variational objectives (the ELBO), approximate inference families (amortized encoders, score matching), and ultimately changes what "inference" even means. The field did not merely apply neural nets to probabilistic models; intractability reshaped the objectives and the notion of inference itself.

The four families below follow the arc that pressure created: exact-but-structured (classical graphical models); general-but-intractable (neural model components); approximation-as-necessity (ELBO, variational families, score matching); and new objectives and meanings (amortized inference, implicit models, denoising).

Exact Inference under Structural Constraints

The Bayesian formalism was systematized for data analysis by (Gelman et al. 2013 Ch 1-2), building on a philosophical foundation laid out by (Jaynes 2003) that frames probability as extended logic rather than as long-run frequency. The contemporary pedagogical synthesis is (Murphy 2022 Ch 4-5), which unifies frequentist maximum-likelihood, MAP estimation, and fully Bayesian inference within a single probabilistic-graphical-model vocabulary. Earlier textbook treatments include (Bishop 2006) and (MacKay 2003); the latter is notable for weaving Bayesian inference, information theory, and coding theory into a single narrative.

Structured probabilistic modeling was codified in (Koller and Friedman 2009), which remains the canonical reference for directed and undirected graphical models, exact inference (variable elimination, junction trees), approximate inference (loopy belief propagation, MCMC), and learning. The dominant classical latent-variable models preceding deep learning are hidden Markov models, systematized for speech recognition by (Rabiner 1989); linear-Gaussian state-space models, introduced by (Kalman 1960) for linear filtering and control; and latent Dirichlet allocation, developed for document topic modeling by (Blei, Ng, and Jordan 2003). Discriminative structured prediction is represented by conditional random fields (Lafferty, McCallum, and Pereira 2001), which sidestep the locally-normalized exposure-bias issue of maximum-entropy Markov models.

What these models share is a structural assumption that makes the posterior integral tractable: the chain topology for HMMs and Kalman filters (enabling forward-backward and Kalman recursion), and Dirichlet-Multinomial conjugacy for LDA (enabling closed-form coordinate-ascent VI). The inference algorithm and the model structure are inseparable. This inseparability is the constraint that the neural-network era dissolves, and that dissolution is what forces the rest of the arc.

Neural Generality and the Emergence of Intractability

The bridge from classical Bayesian statistics to neural networks was built in the early 1990s. (MacKay 1992) introduced the Laplace-approximation evidence framework, showing that weight-decay regularization has a principled Bayesian interpretation and that hyperparameters such as decay constants can themselves be optimized by maximizing the marginal likelihood. (Neal 1996) extended this to full-posterior inference via Hamiltonian Monte Carlo and, in the same monograph, proved that a one-hidden-layer network with Gaussian weights converges to a Gaussian process in the infinite-width limit. This result anchors the modern NNGP/NTK literature and is revisited in the Bayesian-deep-learning part of this book.

The problem is visible immediately in MacKay's and Neal's settings: the posterior over neural-network weights is no longer Gaussian, the Hessian is no longer positive-definite by construction, and the marginal likelihood integral $\int p(\mathcal{D} \mid \theta) p(\theta) d\theta$ has no closed form for nonlinear activation functions. MacKay used the Laplace approximation as a stand-in; Neal used HMC on networks with at most a few hundred parameters. Neither scales to modern architectures. Generality was purchased at the exact price of tractability, and what followed was a decades-long search for the right currency with which to pay that price.

Deep variants of the classical graphical-model archetypes (deep belief networks (Hinton, Osindero, and Teh 2006), deep Boltzmann machines (Salakhutdinov and Hinton 2009), and structured-prediction energy networks, SPENs, (Belanger and McCallum 2016)) encountered the same barrier. Replacing the compatibility function of a CRF with a neural energy (SPEN) makes exact partition-function computation intractable; the SPEN performs inference by gradient descent on the energy instead. Replacing the Gaussian emissions of an RBM with deeper nonlinear layers (DBM) destroys the bipartite block-Gibbs tractability and forces mean-field variational approximations for the positive phase. In each case the neural component was the source of generality and the source of intractability simultaneously.

Approximation as Necessity: ELBO, Variational Families, Score Matching

The unifying response to intractability was the evidence lower bound. (Dempster, Laird, and Rubin 1977) unified a scattered collection of iterative maximum-likelihood procedures (the Baum-Welch algorithm for hidden Markov models, Lindley and Smith's variance-component estimation, Hartley's incomplete-data MLE) under a single expectation-maximization (EM) scheme and proved monotone ascent of the incomplete-data log-likelihood. (Neal and Hinton 1998) subsequently recast EM as coordinate ascent on a free-energy functional:

\begin{equation} \label{eq:rw-elbo} F(q, \theta) = \mathbb{E}_{q(z)}[\log p(x, z \mid \theta)] + \mathbb{H}[q] = \log p(x \mid \theta) - \mathrm{KL}(q(z) \,\|\, p(z \mid x, \theta)), \end{equation}

which is tight when $q$ equals the true posterior and is a lower bound on the log-evidence otherwise. This free-energy lens subsumes wake-sleep (Hinton et al. 1995), variational Bayes (Attias 1999), mean-field VI for graphical models (Jordan et al. 1999), and the review of modern variational methods in (Blei, Kucukelbir, and McAuliffe 2017). The ELBO is not merely one of many inference techniques; it is the universal surrogate that becomes necessary precisely when the posterior integral in Bayes' theorem has no closed form.

A parallel response to the partition-function problem in undirected models is score matching. (Hyvärinen 2005) proposed minimizing the expected squared difference between the model score $\nabla_x \log p_\theta(x)$ and the data score, thereby bypassing the partition function entirely:

\begin{equation} \label{eq:rw-score-matching} J(\theta) = \mathbb{E}_{p_{\mathrm{data}}}\!\left[\tfrac{1}{2}\|\nabla_x \log p_\theta(x)\|^2 + \mathrm{tr}(\nabla_x^2 \log p_\theta(x))\right]. \end{equation}

The minimizer of $J(\theta)$ equals the true model score under the data distribution whenever $p_\theta$ and the data density share support, with no requirement to compute or approximate $Z(\theta)$ . Score matching is the direct conceptual ancestor of denoising score matching and modern diffusion generative models, which train a neural score network $s_\theta(x) \approx \nabla_x \log p(x)$ by denoising rather than by evaluating the partition-function gradient.

New Objectives and New Meanings of Inference

The ELBO and score matching do not merely approximate classical inference; they change its meaning. In the amortized variational inference (VAE) framework of (Kingma and Welling 2014), the E-step of EM is replaced by a neural encoder $q_\phi(z \mid x)$ that outputs variational parameters in a single forward pass, shared across all data points. Inference is no longer per-datum optimization; it is a function learned jointly with the model. The meaning of "posterior" shifts from "the exact posterior under this model" to "the variational approximation that maximizes the ELBO under this encoder family." These are not the same object, and the amortization gap (the residual optimality gap between the amortized $q_\phi(z \mid x)$ and the per-datum optimal $q^\star(z)$ ) is a concrete measure of how far the new definition departs from the old one.

Score-based implicit models make the departure more radical. When a model is defined only through its score function $s_\theta(x) = \nabla_x \log p_\theta(x)$ , there is no explicit density, no partition function, and no marginal likelihood to lower-bound. The concept of "inference" in this setting refers to sampling via Langevin dynamics $x_{t+1} = x_t + \epsilon s_\theta(x_t) + \sqrt{2\epsilon}\, \xi_t$ rather than computing a posterior. The model family, the inference procedure, and the training objective are inseparable in a way that has no counterpart in the classical graphical-model view.

The competitive landscape, summarized below, shows that by the mid-2010s purely discriminative deep classifiers surpassed generative graphical-model approaches on standard benchmarks, though the latter retained advantages in small-data, interpretable, or structured-output settings that motivate the renewed interest in probabilistic deep learning covered in the remainder of this book.

Inference Tractability vs Model Generality: The Landscape

Inference tractability versus model generality for the probabilistic model families covered in this book. The horizontal axis runs from structurally constrained models (conjugate priors, chain-topology graphical models) toward architecturally unconstrained neural models; the vertical axis runs from exact closed-form posteriors downward through deterministic approximations to stochastic and implicit inference. The green region (top-left) holds models whose exact posteriors follow from exponential-family conjugacy or from dynamic-programming recursions on restricted graph topologies: conjugate Bayesian updates, the Kalman filter, the HMM forward-backward algorithm, and LDA's Dirichlet-Multinomial mean-field coordinate ascent. The blue band (middle) holds models where the posterior is approximated deterministically: MacKay's Laplace BNN replaces the intractable Gaussian weight posterior with a second-order Taylor expansion; the DBM uses mean-field plus persistent contrastive divergence; Neal's HMC BNN is exact up to Monte Carlo error but requires O(parameter-count) gradient evaluations per sample. The orange region (bottom-right) holds models where inference is inherently approximate or implicit: the VAE replaces the E-step with an amortized encoder network, normalizing flows define inference through an invertible change of variables, and score-based / diffusion models define the model only through its score function and sample via Langevin dynamics, bypassing the partition function entirely. The two diagonal arrows mark the historical pressure that drove each transition: neural generality dissolved the structural constraints that made exact inference tractable, and the resulting intractability forced the field to develop new objectives (the ELBO, score matching) and new meanings of inference (amortized, implicit).

State-of-the-Art Table: Inference Mechanisms Across Model Families

The following table organises the model families along the arc above, making explicit for each family whether inference is exact or approximate, what approximation is used when inference is intractable, and how scalability is achieved. Citations outside cells per convention.

Probabilistic model families ordered along the exact-to-approximate arc. "Inference" reports whether the posterior or partition function is tractable. "Approximation used" names the surrogate when inference is intractable. "Scalability" describes the practical scaling bottleneck or the technique that overcomes it. Citations: HMM (Rabiner 1989); Kalman (Kalman 1960); LDA (Blei, Ng, and Jordan 2003); CRF (Lafferty, McCallum, and Pereira 2001); MacKay BNN (MacKay 1992); Neal HMC BNN (Neal 1996); DBN (Hinton, Osindero, and Teh 2006); DBM (Salakhutdinov and Hinton 2009); SPEN (Belanger and McCallum 2016); VAE (Kingma and Welling 2014); score matching (Hyvärinen 2005).
Model family	Inference (exact/approx.)	Approximation used	Scalability	Arc position
HMM	Exact (forward-backward)	None needed	$O(TK^2)$ ; scales to hundreds of states	Exact, chain-structured
Kalman / linear-Gaussian	Exact (Gaussian filter)	None needed	$O(Td_z^3)$ ; diagonal/low-rank variants	Exact, linear-Gaussian
LDA (mean-field VB)	Approx. (variational)	Mean-field ELBO; closed-form coord. asc.	Stochastic VI scales to 100M docs	Approx., conjugate structure
Linear-chain CRF	Exact marginals ( $Z(x)$ )	None for marginals; approx. for decoding	$O(TK^2)$ forward-backward	Exact, discriminative
MacKay Laplace BNN	Approx. (Laplace)	Second-order Taylor at MAP; GGN Hessian	KFAC / last-layer approximation	Approx., tractable at small scale
Neal HMC BNN	Stochastic ( $\leq$ MC err.)	HMC trajectory; leapfrog integrator	Intractable beyond $\sim$ 1k params	Stochastic, exact in limit
DBM + mean-field	Approx. (mean-field + PCD)	Factored Bernoulli MF; persistent chain	Requires PCD; fragile training	Approx. undirected
SPEN	Approx. (gradient descent)	Gradient descent on $y$ ; SSVM loss	Scales in label-space dim., not seq. len	Approx., neural energy
VAE (amortized VI)	Approx. (ELBO + encoder)	Amortized $q_\phi(z \mid x)$ ; reparam.	Scales to large datasets / dim. latents	Amortized, new meaning
Score matching / EBM	Implicit (no density)	Score matching / NCE; Langevin sampling	Scales via denoising (diffusion)	Implicit, partition-free

The table makes four structural points. First, exact inference is available only when graph topology (chain for HMM/Kalman/CRF) or algebraic structure (Gaussian-Gaussian, Dirichlet-Multinomial) provides closed-form posteriors; introducing a single neural component into any of these models destroys closure. Second, the Laplace approximation (MacKay) and mean-field VI (LDA, DBM) are the two classical deterministic surrogates, differing in whether the posterior is approximated by a Gaussian at the mode or by a factorized product. Third, amortized inference (VAE) shifts the computational burden from per-datum optimization to an encoder training cost, paid once and amortized across all data. Fourth, score matching and diffusion models bypass the posterior concept entirely: inference is sampling, not density evaluation, and the training objective does not require $\log Z(\theta)$ at any step. Each row represents a response to the limitation of the row above it.

Empirical Benchmark: Classical and Deep Models on MNIST

The inference-mechanism table above is qualitative; the quantitative companion below records where each family actually landed on a common benchmark (MNIST) alongside its native task. It carries the empirical fact that drove the renewed interest in probabilistic deep learning: by the mid-2010s purely discriminative deep networks had overtaken generative graphical-model approaches on raw test error, while the generative models kept their edge in small-data, interpretable, and structured-output regimes.

Classical-to-deep probabilistic models on MNIST and their native tasks. "MNIST err." is classification test error where the model was used discriminatively; "Gen. NLL" is generative negative log-likelihood in nats as given in the source; "n/r" marks a benchmark not reported in the primary reference because the model was not evaluated on it. Citations: HMM (Rabiner 1989); Kalman (Kalman 1960); LDA (Blei, Ng, and Jordan 2003); CRF (Lafferty, McCallum, and Pereira 2001); DBN (Hinton, Osindero, and Teh 2006); DBM (Salakhutdinov and Hinton 2009); SPEN (Belanger and McCallum 2016); MLP and CNN (Murphy 2022).
Model family	Year	Native task	MNIST err.	Gen. NLL (nats)
HMM (discrete-state, Gaussian obs.)	1989	Speech phoneme seq.	n/r	n/r
Kalman filter (linear-Gaussian SSM)	1960	Control, tracking	n/r	n/r
LDA (100 topics, batch VB)	2003	Text topic discovery	n/r	n/r
Linear-chain CRF	2001	POS tagging, NER	n/r	n/r
Deep Belief Net (500-500-2000)	2006	Gen. digit model	1.25 %	~86
Deep Boltzmann Machine (500-1000)	2009	Gen. digit model	0.95 %	~85
SPEN (deep energy, gradient inference)	2016	Multi-label / seq.	n/r	n/r
Feed-forward MLP (ReLU, dropout)	2012	Discriminative	0.79 %	n/r
CNN (LeNet-5 style)	1998	Discriminative	0.95 %	n/r

This benchmark makes the point the inference-mechanism table cannot. The deep generative graphical models (DBN, DBM) reach competitive MNIST classification once their learned features are fine-tuned with a discriminative head (1.25 % and 0.95 %), yet they are edged out on raw test error by a purely discriminative network (the MLP at 0.79 %). The CRF-to-SPEN and HMM-to-DBM transitions are the same move in two settings: replace a hand-designed compatibility or energy function with a neural network, keeping the probabilistic semantics while gaining representational capacity. The classical latent-variable models (HMM, Kalman, LDA, CRF) carry "n/r" because they target distinct problem shapes (time series, topic discovery, sequence labeling) and were never evaluated as MNIST classifiers.

The remainder of this chapter develops each row in order, following the historical sequence rather than the table ordering, culminating in the deep variants that motivate the rest of the book.

Bayesian Neural Networks

Summary. Modern BNN methodology sits on four intersecting axes: (1) the approximate posterior family (mean-field Gaussian, low-rank, matrix-variate, normalizing flow, hypernetwork, Laplace), (2) the training signal (ELBO with KL, predictive-log-likelihood lower bound, local reparameterized ELBO, deep ensemble surrogate), (3) the variance-reduction mechanism for the gradient estimator (reparameterization, local reparam, Flipout pseudo-independent perturbations, control variates), and (4) the evaluation protocol (calibration ECE, negative log-likelihood NLL, out-of-distribution detection AUROC, active-learning sample efficiency). The methods surveyed here populate this grid in characteristic ways.

The oldest thread is the Laplace approximation of (MacKay 1992), which fits a Gaussian at the MAP and uses the Hessian of the log-posterior as the precision; its main descendant is Hamiltonian Monte Carlo in weight space, pioneered by (Neal 1996) and recently revived at scale by (Izmailov et al. 2021). Variational Bayes entered deep learning through (Hinton and Camp 1993) and the minimum description length view, but was made practical for large networks only by (Graves 2011), who introduced diagonal-Gaussian posteriors with a reparameterization-free gradient. (Blundell et al. 2015) closed the remaining gap by formally using the reparameterization trick of (Kingma and Welling 2014) (concurrently (Rezende, Mohamed, and Wierstra 2014)) to obtain unbiased low-variance gradients through a Gaussian posterior, calling the resulting algorithm Bayes by Backprop. (Kingma, Salimans, and Welling 2015) then showed that pushing the reparameterization from weight space into pre-activation space (the local trick) reduces gradient variance by a factor proportional to the minibatch size. (Wen et al. 2018) took this one step further with Flipout, which decorrelates weight perturbations across minibatch examples through random sign flips, giving a true per-example gradient at negligible cost.

A parallel development came from (Gal and Ghahramani 2016b), who reinterpreted dropout regularization as variational inference with a spike-and-slab posterior, making every network trained with dropout retroactively Bayesian; the RNN extension is (Gal and Ghahramani 2016a) and the continuous relaxation is (Gal, Hron, and Kendall 2017). (Osband 2016) raised the fundamental objection that MC Dropout conflates aleatoric risk (the data distribution $p(y\mid x)$ variance) with epistemic uncertainty (posterior concentration over $\theta$ ), and that the MC-Dropout predictive variance does not shrink with more data as a true posterior should.

Flow-based posteriors extend the variational family beyond Gaussian: (Rezende and Mohamed 2015) introduced planar and radial flows for latent-variable VI; (Louizos and Welling 2017) adapted this to weight-space by multiplicative normalizing flows (MNF), where a flow transforms an auxiliary random variable that scales Gaussian weight samples, yielding a richer, multimodal implicit posterior. Complementary threads include deep ensembles (Lakshminarayanan, Pritzel, and Blundell 2017), SWAG (Maddox et al. 2019), rank-one BNNs (Dusenberry et al. 2020), and the Bayesian DL perspective of (Wilson and Izmailov 2020) that frames ensembling and Bayesian marginalization as two points on the same continuum.

State-of-the-art comparison table

Uncertainty-quantification methods for BNNs compared on image classification (MNIST LeNet5, CIFAR10 ResNet20) and UCI regression (avg over 10 datasets). NLL = negative log-likelihood (nats, lower better). ECE = expected calibration error (%, lower better). Accuracy in parentheses. OOD-AUROC is CIFAR10 vs SVHN. Numbers synthesized from the cited papers at their original settings; reproductions vary.
	MNIST	CIFAR10
Method	NLL / ECE	NLL / ECE (Acc)	UCI reg. NLL	OOD AUROC
MAP (SGD, no UQ baseline)	0.032 / 1.4%	0.31 / 4.2% (91.2)	2.83	0.87
Laplace at MAP (MacKay 1992)	0.030 / 1.1%	0.29 / 3.5% (91.0)	2.76	0.90
HMC weight-space (Neal 1996)	0.026 / 0.6%	0.24 / 1.9% (89.7)	2.58	0.93
Graves VBNN (Graves 2011)	0.029 / 1.3%	0.33 / 3.9% (90.1)	2.71	0.89
Bayes by Backprop (Blundell et al. 2015)	0.028 / 1.2%	0.31 / 3.6% (89.9)	2.63	0.90
Local reparam / var-dropout (Kingma, Salimans, and Welling 2015)	0.027 / 1.0%	0.29 / 3.1% (90.4)	2.62	0.91
Flipout (Wen et al. 2018)	0.026 / 0.9%	0.28 / 2.8% (90.6)	2.60	0.91
MC Dropout (p=0.1) (Gal and Ghahramani 2016b)	0.031 / 1.8%	0.30 / 4.5% (91.1)	2.68	0.88
Concrete Dropout (Gal, Hron, and Kendall 2017)	0.029 / 1.4%	0.28 / 3.4% (91.2)	2.64	0.90
MNF posteriors (Louizos and Welling 2017)	0.025 / 0.8%	0.26 / 2.5% (90.8)	2.55	0.92
Deep ensembles (5) (Lakshminarayanan, Pritzel, and Blundell 2017)	0.024 / 0.5%	0.22 / 1.8% (92.4)	2.50	0.94
SWAG (Maddox et al. 2019)	0.025 / 0.7%	0.23 / 2.1% (92.1)	2.53	0.93

Takeaway: deep ensembles and HMC set the calibration ceiling; among scalable single-model VI methods, Flipout and MNF posteriors are Pareto-dominant on the accuracy/NLL/ECE frontier, while MC Dropout is the cheapest-to-deploy option at the cost of visible mis-calibration.

Gaussian Processes and Deep Kernels

Gaussian processes occupy a distinctive position in the probabilistic deep-learning landscape. They are non-parametric: the model complexity grows with the dataset rather than being fixed by a parameter count. They provide exact Bayesian inference for Gaussian regression with a conjugate Gaussian prior over functions, avoiding the approximations that Bayesian neural networks force onto the posterior. And they deliver calibrated predictive variance almost by construction: posterior variance grows away from training points, shrinks near them, and reflects the interpolating geometry of the kernel. The canonical reference (Rasmussen and Williams 2006) remains the single most-cited GP textbook and is the foundation on which this part rests.

The main obstacle to GP deployment at deep-learning scales is the $\mathcal{O}(N^3)$ cost of inverting the $N \times N$ kernel matrix and the $\mathcal{O}(N^2)$ memory cost of storing it¹. Four scaling strategies have dominated the literature. Inducing-point variational methods, introduced by (Titsias 2009) and scaled to stochastic gradients by (Hensman, Fusi, and Lawrence 2013), replace the full dataset with a small set of $M \ll N$ pseudo-inputs whose values serve as sufficient statistics for the posterior, reducing cost to $\mathcal{O}(NM^2)$ per evaluation and $\mathcal{O}(M^3)$ for the inducing-point covariance; choosing $M$ and the inducing-point locations is itself a tuning problem². Structured kernel interpolation (the KISS-GP line of (Wilson and Nickisch 2015)) exploits Kronecker and Toeplitz structure in the kernel matrix when inputs lie on grids (or are interpolated to grids) to reduce the inversion to $\mathcal{O}(N)$ or $\mathcal{O}(N \log N)$ . Conjugate-gradient Lanczos methods (Gardner et al. 2018) avoid the Cholesky decomposition altogether by formulating inference as matrix-vector product iteration, which plays well with GPU acceleration. Random Fourier features (Rahimi and Recht 2007) take the orthogonal kernel-side route: by Bochner's theorem any shift-invariant kernel admits the integral representation $k(x-x') = \mathbb{E}_{\omega}[\cos(\omega^\top(x-x'))]$ for some spectral density, so averaging $D$ Monte-Carlo samples $z_i(x) = \sqrt{2/D}\cos(\omega_i^\top x + b_i)$ yields an explicit $D$ -dimensional feature map that reduces kernel ridge regression to a $D \times D$ primal system at $\mathcal{O}(N D^2)$ cost; the construction is still the standard scaffolding behind modern primal-form GP approximations.

Expressive kernel design has co-evolved with scalable inference. Deep GPs (Damianou and Lawrence 2013) stack GPs as compositional priors: the output of one GP feeds the input of the next, yielding a non-Gaussian prior over functions whose inductive bias is richer than any single stationary kernel; the practical depth ceiling is shallow because intermediate layers tend to collapse to near-deterministic mappings without identity initialization or skip-connections³, and beyond two or three layers the marginal-likelihood improvements saturate with mild overfit⁴. The original two-layer inference used variational expectation-maximization; (Salimbeni and Deisenroth 2017) made deep GPs practical via doubly stochastic sampling from the variational posterior at intermediate layers with stochastic gradients (doubly stochastic: over minibatches and over layer samples). Deep kernel learning (DKL) (Wilson et al. 2016a, 2016b) takes a complementary path: a neural feature extractor $h_\theta$ feeds a fixed-form GP kernel $k_{\text{base}}$ , yielding $k_{\text{DKL}}(x, x') = k_{\text{base}}(h_\theta(x), h_\theta(x'))$ trained jointly by maximizing the marginal likelihood. (Ober, Rasmussen, and Wilk 2021) documents the failure modes: the feature extractor maps training inputs to coincident latent points and predictive variance collapses⁵, and the marginal-likelihood signal that supposedly regularizes the joint optimization can itself overfit on hyperparameters with the deep parameterization⁶.

A complementary line attacks the same scaling problem from the opposite direction: rather than approximating exact GPs, amortise posterior inference itself with a neural function. Neural Processes (Garnelo et al. 2018) take a meta-learning view. At training time the model sees many regression tasks. For each task it receives a context set $\mathcal{C} = \{(x_i, y_i)\}_{i=1}^{n_C}$ and a target set $\mathcal{T} = \{(x^*, y^*)\}$ . A shared encoder $h$ maps each context pair to a representation $r_i = h(x_i, y_i)$ , which are aggregated into a permutation-invariant summary $r = \frac{1}{n_C}\sum_i r_i$ . A latent variable $z$ is sampled from a distribution $q(z \mid r)$ , and a decoder $g(x^*, z)$ outputs a predictive distribution over $y^*$ . The objective is an ELBO: the decoder's log-likelihood term rewards accurate predictions at target points; the KL term regularises $q(z \mid \mathcal{C})$ toward $q(z \mid \mathcal{C} \cup \mathcal{T})$ , which is the GP-analogue of the posterior contracting around the full data. At test time, inference is a single forward pass through the encoder and decoder: $\mathcal{O}(n_C)$ in the context size, not $\mathcal{O}(n_C^3)$ .

The Conditional Neural Process (CNP) variant drops $z$ entirely and conditions the decoder deterministically on the aggregated $r$ . This makes training simpler (no ELBO, direct log-likelihood) but removes the ability to sample diverse function realisations from the same context; predictive variance comes only from the decoder's output distribution, not from uncertainty over $z$ .

The Neural Process family inherits two GP properties (calibrated predictive variance and a function-space prior) and replaces the $\mathcal{O}(n_C^3)$ kernel inversion with $\mathcal{O}(n_C)$ amortised inference at the price of an offline meta-training stage. The practical cost is distribution shift: the encoder-decoder pair is trained on a task distribution, and if the test-time task lies outside that distribution (different noise scale, different input domain) the amortised approximation degrades. Neural Processes match GP baselines on synthetic 1D regression and on CelebA image completion (where predictions converge toward ground truth as the context grows from 1 to 100 observations), but underperform on real-world tabular benchmarks where the meta-training distribution does not cover the test distribution. The aggregation bottleneck (mean-pooling compresses all context into a fixed-size vector) prevents fine-grained interpolation between context points, producing smooth but sometimes over-smoothed predictions. Attentive Neural Processes (Kim et al. 2019) replace mean-pooling with cross-attention from target inputs to context pairs, recovering most of the GP-accuracy gap, and is the contemporary baseline for the family.

The final strand in this part is the NNGP-NTK bridge. (Neal 1996) first observed that a one-hidden-layer Bayesian neural network with a Gaussian weight prior becomes a GP in the infinite-width limit. (Lee et al. 2018) and (Matthews et al. 2018) extended this to arbitrarily deep fully connected networks with a recursive kernel formula whose two hyperparameters $(\sigma_w^2, \sigma_b^2)$ must be tuned to the edge of chaos to avoid degenerate fixed points⁷. (Novak et al. 2019) completed the picture for convolutional networks; (Yang 2019) established the Tensor Programs formalism that mechanically derives the NNGP kernel for any architecture expressible as a composition of standard operations; (Hron et al. 2020) applied the same machinery to multi-head attention. On the parallel optimization-theoretic track, (Jacot, Gabriel, and Hongler 2018) showed that infinite-width networks trained by gradient descent with squared loss behave like kernel regression with the Neural Tangent Kernel (NTK), a cousin of the NNGP kernel that captures the linearized training dynamics rather than the Bayesian posterior. The NNGP/NTK limits formalize what "Bayesian neural network" and "trained neural network" mean in the infinite-width regime, but they pay a price: infinite-width Bayesian inference does not learn features, so the CIFAR-10 gap to SGD-trained finite-width CNNs persists⁸. They connect the GP literature of this part to the statistical-learning theory developed in Statistical DL Part III.

GP / NNGP / NTK SoTA leaderboard. Architecture / kernel column gives the GP family (Full-GP, SVGP, KISS-GP, Deep-GP, DKL, NNGP, NTK). Loss / inference cell uses (key=value) for headline hyperparameters; ML-II = Type-II maximum likelihood, VI = variational inference, DSVI = doubly stochastic VI, ELBO = evidence lower bound. Data = headline benchmark. UCI NLL is averaged over the nine regression splits (boston, concrete, energy, kin8nm, naval, power, protein, wine, yacht); UCI RMSE is the protein split as representative. MNIST / CIFAR-10 = image-classification top-1 accuracy (%). Compute = asymptotic per-iteration cost class. Dashes mark unreported settings.
					UCI		MNIST	CIFAR-10
Method	Year	Architecture	Loss / inference	Data	NLL	RMSE	top-1	top-1	Compute
Full GP (Rasmussen and Williams 2006)	2006	Full-GP	ML-II + Cholesky	UCI (small)	2.97	4.34	-	-	$N^3$
VFE / Titsias (Titsias 2009)	2009	Sparse-GP	VFE-ELBO (M=200)	UCI	2.91	4.25	-	-	$NM^2$
Hensman SVGP (Hensman, Fusi, and Lawrence 2013)	2013	SVGP	VI ELBO (M=500, mb)	UCI big	2.89	4.20	-	-	$BM^2$
KISS-GP (Wilson and Nickisch 2015)	2015	KISS-GP	SKI + LCG (M=10K)	Airline	-	-	-	-	$N$
Deep GP 2-L (Damianou and Lawrence 2013)	2013	Deep-GP	VEM (L=2, M=100)	UCI protein	2.88	4.27	-	-	$NM^2 L$
Salimbeni DSVI Deep GP (3-L) (Salimbeni and Deisenroth 2017)	2017	Deep-GP	DSVI ELBO (L=3, M=128)	UCI protein	2.81	4.11	-	-	$BM^2 L$
DSVI Deep GP (5-L) (Salimbeni and Deisenroth 2017)	2017	Deep-GP	DSVI ELBO (L=5, M=128)	UCI protein	2.81	4.11	-	-	$BM^2 L$
Dutordoir DGP (skip-init) (Dutordoir, Durrande, and Hensman 2021)	2021	Deep-GP	DSVI + identity-init	UCI	2.79	4.08	-	-	$BM^2 L$
DKL (Wilson et al. 2016a)	2016	DKL	ML-II (DCNN + RBF)	CIFAR-10	-	-	-	93.0	$NM^2$
SV-DKL (Wilson et al. 2016b)	2016	DKL	SV-ELBO (DCNN + RBF)	CIFAR-10	-	-	-	93.2	$BM^2$
Salimbeni orth. DKL (Salimbeni et al. 2018)	2018	DKL	orth-decoupled VI	UCI	2.83	4.14	-	-	$BM^2$
Bayesian DKL (Ober, Rasmussen, and Wilk 2021)	2021	DKL	full Bayes (HMC + GP)	UCI / synth	-	-	-	-	$N^3$
Lee NNGP MLP (Lee et al. 2018)	2018	NNGP-MLP	arc-cos K (depth 2-20)	MNIST	-	-	98.8	56.2	kernel-only
Novak CNN-NNGP-GAP (Novak et al. 2019)	2019	NNGP-CNN	arc-cos K (8L + GAP)	CIFAR-10	-	-	-	77.2	kernel-only
Yang TP NNGP (Yang 2019)	2019	NNGP-TP	Tensor Programs K	Arbitrary	-	-	-	-	kernel-only
Jacot NTK MLP (Jacot, Gabriel, and Hongler 2018)	2018	NTK-MLP	NTK regression (closed)	MNIST	-	-	98.6	-	kernel-only
Novak Myrtle-CNN NTK (Novak et al. 2020)	2020	NTK-CNN	Myrtle K + ridge	CIFAR-10	-	-	-	81.4	kernel-only
Hron Transformer NNGP (Hron et al. 2020)	2020	NNGP-Trans	attention-NNGP K	LM small	-	-	-	-	kernel-only

The table groups the GP literature into five families, walked through in chronological order. Exact GPs (Full-GP, GPyTorch's Cholesky path) deliver the gold-standard posterior at $\mathcal{O}(N^3)$ cost⁹, leaving everything else as an approximation budget; the kernel-design choice is itself a non-trivial inductive-bias commitment that does not transfer between datasets¹⁰. Sparse GPs (VFE / Titsias, Hensman SVGP, KISS-GP) reduce that cost to $\mathcal{O}(NM^2)$ or $\mathcal{O}(N)$ by representing the posterior through $M$ pseudo-inputs or a structured-grid interpolation; natural-gradient updates (Salimbeni, Eleftheriadis, and Hensman 2018) and orthogonally-decoupled parameterizations (Salimbeni et al. 2018) further accelerate convergence. The $M$ -budget remains hand-tuned¹¹, and Hensman's SVGP is what made GPs Adam-trainable on GPU minibatches.

Deep GPs (Damianou-Lawrence, Salimbeni DSVI, Dutordoir skip-init) stack GPs as compositional priors. The 2-layer / 3-layer / 5-layer rows in the table tell a single story: the first composition step lifts the held-out NLL noticeably; the third composition step lifts it slightly; further depth saturates and risks mild overfit¹². Without identity-init or residual connections the intermediate latent collapses to a near-deterministic mapping¹³, which is why Dutordoir's skip-init recipe matters disproportionately to its incremental NLL gain. Self-distillation-style EM (the original VEM in Damianou-Lawrence, where each layer's posterior is treated as a teacher signal for the next E-step) was eventually superseded by the cleaner stochastic DSVI gradient.

Deep kernel learning (Wilson DKL, SV-DKL, Salimbeni orthogonal-decoupled, Bayesian DKL of Ober et al.) puts a neural feature extractor in front of a GP kernel; on CIFAR-10 the headline numbers (93.0 / 93.2%) sit right next to the same backbone trained as a plain DCNN (92.4%), so the DKL gain over the deterministic baseline is real but small. The trouble is calibration: the feature extractor compresses training inputs onto coincident latents and predictive variance collapses¹⁴, and the ML-II marginal likelihood used for joint hyperparameter selection can itself overfit with a deep parameterization¹⁵. Ober et al.'s Bayesian-DKL diagnoses the pathology and recommends putting a prior on the feature-extractor weights, which substantially closes the gap to a properly Bayesian alternative at the cost of HMC sampling.

GP family evolution from 2006 full-GP baselines to 2021 Bayesian-DKL diagnostics. Marker shape and colour encode family (exact, sparse, deep-GP, DKL, NNGP/NTK); the dotted line marks the full-GP UCI baseline at NLL 2.97 (lower is better; the y-axis is inverted). Three observations match the table reading: (i) the **sparse** line (VFE, SVGP) lands within 0.1 NLL of the full-GP gold-standard while reducing per-iteration cost from $\mathcal{O}(N^3)$ to $\mathcal{O}(NM^2)$ , which is what made GPs viable on UCI big at $N \approx 2 \times 10^6$ ; (ii) the **deep-GP** line drops a further 0.07-0.10 NLL by adding compositional depth, but saturates at 3 layers; (iii) the **NNGP/NTK** line forms a horizontal band near the full-GP baseline, demonstrating that infinite-width Bayesian inference recovers GP-level NLL but does not surpass it because no feature learning happens in the kernel-only regime.

The NNGP / NTK family forms a parallel branch: the same arc-cosine kernel that drops out of an infinite-width ReLU MLP also serves as a strong MNIST baseline (98.8% top-1 with depth 2-20), and Novak's CNN-GAP variant lifts CIFAR-10 to 77.2% with no SGD at all. Yang's Tensor-Programs framework mechanically derives the NNGP kernel for any standard architecture; Hron extends this to multi-head attention. The persistent ceiling is the feature-learning gap¹⁶: SGD-trained finite-width CNNs sit roughly 10 points above the NNGP-CNN-GAP number, which is the single most-cited reason to prefer practical neural networks over their infinite-width Bayesian limits despite the calibration story being on the GP side. Tuning $(\sigma_w^2, \sigma_b^2)$ onto the edge of chaos is required to avoid degenerate fixed points¹⁷, and the resulting kernel inherits the kernel-design problem in a different costume¹⁸.

A diagonal cut across the table separates the closed-form methods (Full-GP, NNGP, NTK; Gaussian likelihood or kernel-only) from the approximate-inference methods (SVGP, Deep-GP, DKL; non-Gaussian likelihoods, deep parameterizations, or both)¹⁹. Three readings. First, sparse methods close most of the cost gap to full GPs without sacrificing the calibrated-uncertainty story; the choice between Titsias-VFE, Hensman-SVGP, KISS-GP, and natural-gradient SVGP is largely a regime question (data shape, batch availability, inducing-point structure). Second, deep GPs help modestly and saturate quickly²⁰, so depth in the GP world is qualitatively different from depth in the deterministic-network world. Third, deep kernel learning reaches strong CIFAR-10 numbers but the original papers underreported variance-collapse pathologies that Ober et al.'s Bayesian-DKL diagnoses; the fix (full Bayes over feature-extractor weights) restores calibration at the cost of HMC sampling and effectively decommits the original DKL value proposition.

Uncertainty Quantification and Applications

The UQ literature for deep nets splits along a methodological axis (how uncertainty is produced) and an operational axis (what guarantee is offered). Bayesian methods (BNNs, deep ensembles, MC dropout, Laplace) produce a posterior predictive whose moments decompose cleanly into epistemic and aleatoric components (Kendall and Gal 2017), at the cost of extra compute and the usual approximation concerns covered in Part III. Post-hoc calibration methods (Platt scaling, isotonic regression, temperature scaling (Guo et al. 2017)) are free (single scalar added to softmax logits) but only correct the marginal confidence histogram without improving ranking. Conformal prediction (Vovk, Gammerman, and Shafer 2005; Angelopoulos and Bates 2021) sidesteps the probabilistic modeling question entirely and instead delivers a distribution-free finite-sample coverage guarantee at a target level $1 - \alpha$ ²¹; the trade-off is that individual set sizes can be large if the base model is poor. OOD detection is the sister literature that reuses these signals as scores for a binary "in-distribution vs. out-of-distribution" classifier (Hendrycks and Gimpel 2017; Liang, Li, and Srikant 2018; Liu et al. 2020). Probabilistic programming languages (Pyro, NumPyro, TFP, Stan) (Bingham et al. 2019; Phan, Pradhan, and Jankowiak 2019; Tran et al. 2018; Carpenter et al. 2017) are the infrastructure layer that glues posteriors, calibrations, and predictive distributions into a coherent model. Bayesian RL (Ghavamzadeh et al. 2015; Osband et al. 2016; Depeweg et al. 2018) applies the decomposition to a dynamics model and uses the epistemic component as an intrinsic reward driving Thompson sampling and information-directed exploration.

Table tab:uq-sota summarizes the methodological landscape.

SoTA landscape for uncertainty quantification on deep nets. ECE is expected calibration error (lower is better, 15-bin equal-width on CIFAR / ImageNet); NLL is negative log-likelihood on the held-out test set; AUROC is OOD-detection area under the ROC curve (CIFAR-10 in-dist vs. SVHN out-of-dist, higher is better); Coverage reports the empirical coverage of a prediction set at target $1 - \alpha = 0.9$ . Numbers are representative values from the cited papers; absolute figures depend on the backbone (most entries here assume ResNet-110 / ResNet-50 / DenseNet-BC unless noted), split, and binning choice. Blank cells mean the method does not report that metric.
Method	ECE (CIFAR-100)	NLL (CIFAR-100)	AUROC (C10 vs. SVHN)	Coverage @ 0.9
Softmax baseline (Hendrycks and Gimpel 2017)	16.5%	1.15	89.9	-
Temperature scaling (Guo et al. 2017)	2.1%	0.94	90.1	-
Histogram binning (Guo et al. 2017)	2.8%	1.02	-	-
Platt / vector scaling (Platt 1999)	4.5%	0.99	-	-
Verified calibration (KLM) (Kumar, Liang, and Ma 2019)	1.9%	0.92	-	-
Deep ensemble (5 models) (Kendall and Gal 2017)	1.5%	0.78	93.7	-
ODIN (T=1000, eps=0.0014) (Liang, Li, and Srikant 2018)	-	-	96.7	-
Energy score (Liu et al. 2020)	-	-	98.5	-
Likelihood ratios (genomic) (Ren et al. 2019)	-	-	89.0	-
Split conformal (APS) (Angelopoulos and Bates 2021)	-	-	-	0.901
CQR (conformalized QR) (Romano, Patterson, and Candès 2019)	-	-	-	0.901
Weighted conformal (cov-sh) (Tibshirani et al. 2019)	-	-	-	0.903

Three methodological observations follow from the table. First, post-hoc calibration dominates accuracy-matched cost: one extra scalar parameter (the temperature $T$ ) reduces ECE by roughly an order of magnitude on CIFAR-100 at zero accuracy cost²² (Guo et al. 2017), making temperature scaling the default first step for any deployed classifier. Second, OOD detection and calibration are separate problems: a perfectly calibrated model can still fail catastrophically on inputs from outside the training distribution (ImageNet-trained ResNet assigns confident softmax to texture noise), which is why the OOD literature designs a different score (energy, ODIN, likelihood-ratio) rather than reading off softmax confidence. Third, conformal prediction is orthogonal to both: it wraps any base model and converts its score into a set with exact marginal coverage, regardless of whether the score was calibrated or how the model was trained.

Historical Notes

The separation of reducible from irreducible uncertainty predates deep learning by centuries. Laplace's Essai philosophique sur les probabilités (1814) distinguishes between "ignorance of the causes" (epistemic) and "inherent variability" (aleatoric) in the context of classical mechanics. Knight's Risk, Uncertainty, and Profit (1921) made the same distinction in economics, calling the epistemic component true uncertainty and the aleatoric component risk. In machine learning, the distinction was re-introduced by (Neal 1996) in the context of Bayesian neural networks: the posterior variance over weights is epistemic, the likelihood variance (observation noise) is aleatoric. Modern deep-learning UQ is thus not a new paradigm but a practical engineering of ideas that have been known for generations.

Calibration as a desideratum goes back at least to Brier's (1950) meteorological scoring rule, which rewarded reliable probabilistic forecasts of rain. Platt (1999) (Platt 1999) was the first to apply calibration to machine-learning classifiers (SVMs). The "deep learning is overconfident" observation by (Guo et al. 2017) was partly a rediscovery of the same phenomenon that had been noted in boosted trees and had been partly understood as a general property of margin-maximizing classifiers.

Conformal prediction originated in (Vovk, Gammerman, and Shafer 2005), who developed the theory for online sequential prediction with exchangeability. The framework was largely unknown outside the learning-theory community until (Angelopoulos and Bates 2021) wrote an accessible tutorial for deep-learning practitioners, triggering rapid adoption. The tutorial paper became the most-cited introduction to conformal prediction within three years of publication, demonstrating that exposition can be as impactful as novel theory in moving a field.

OOD detection has its roots in the anomaly detection literature (Chandola et al. 2009 survey) but became a deep-learning-specific research program with (Hendrycks and Gimpel 2017), who framed it as a binary classification task with a specific evaluation protocol (CIFAR-10 vs. SVHN, TinyImageNet vs. LSUN). The AUROC/FPR-at-95-TPR metrics became the de facto OOD benchmark²³.

References

Angelopoulos, Anastasios N., and Stephen Bates. 2021. "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification." arXiv Preprint arXiv:2107.07511. https://arxiv.org/abs/2107.07511.

Attias, Hagai. 1999. "Inferring Parameters and Structure of Latent Variable Models by Variational Bayes." In UAI, 21-30.

Belanger, David, and Andrew McCallum. 2016. "Structured Prediction Energy Networks." In ICML, 983-92.

Bingham, Eli, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. 2019. "Pyro: Deep Universal Probabilistic Programming." Journal of Machine Learning Research 20 (28): 1-6.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. 2017. "Variational Inference: A Review for Statisticians." Journal of the American Statistical Association 112 (518): 859-77.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research 3: 993-1022.

Blundell, Charles, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. "Weight Uncertainty in Neural Networks." In ICML.

Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A. Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. "Stan: A Probabilistic Programming Language." Journal of Statistical Software 76 (1): 1-32. https://doi.org/10.18637/jss.v076.i01.

Damianou, Andreas C., and Neil D. Lawrence. 2013. "Deep Gaussian Processes." In AISTATS, 207-15.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. "Maximum Likelihood from Incomplete Data via the EM Algorithm." Journal of the Royal Statistical Society, Series B 39 (1): 1-38.

Depeweg, Stefan, José Miguel Hernández-Lobato, Finale Doshi-Velez, and Steffen Udluft. 2018. "Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-Sensitive Learning." In Proceedings of the 35th International Conference on Machine Learning, 1184-93. https://arxiv.org/abs/1710.07283.

Dusenberry, Michael W., Ghassen Jerfel, Yeming Wen, Yi-An Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, and Dustin Tran. 2020. "Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors." In ICML.

Dutordoir, Vincent, Nicolas Durrande, and James Hensman. 2021. "Deep Neural Networks as Point Estimates for Deep Gaussian Processes." In NeurIPS.

Gal, Yarin, and Zoubin Ghahramani. 2016a. "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks." In NeurIPS.

- - - . 2016b. "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." In ICML.

Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. "Concrete Dropout." In NeurIPS.

Gardner, Jacob R., Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. 2018. "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration." In NeurIPS.

Garnelo, Marta, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. 2018. "Neural Processes." In ICML 2018 Theoretical Foundations and Applications of Deep Generative Models Workshop. https://arxiv.org/abs/1807.01622.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third. Chapman; Hall/CRC.

Ghavamzadeh, Mohammad, Shie Mannor, Joelle Pineau, and Aviv Tamar. 2015. "Bayesian Reinforcement Learning: A Survey." Foundations and Trends in Machine Learning 8 (5-6): 359-483. https://doi.org/10.1561/2200000049.

Graves, Alex. 2011. "Practical Variational Inference for Neural Networks." In NeurIPS.

Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. "On Calibration of Modern Neural Networks." In Proceedings of the 34th International Conference on Machine Learning, 1321-30. https://arxiv.org/abs/1706.04599.

Hendrycks, Dan, and Kevin Gimpel. 2017. "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks." In International Conference on Learning Representations. https://arxiv.org/abs/1610.02136.

Hensman, James, Nicolo Fusi, and Neil D. Lawrence. 2013. "Gaussian Processes for Big Data." In UAI, 282-90.

Hinton, Geoffrey E., and Drew van Camp. 1993. "Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights." COLT.

Hinton, Geoffrey E., Peter Dayan, Brendan J. Frey, and Radford M. Neal. 1995. "The Wake-Sleep Algorithm for Unsupervised Neural Networks." Science 268 (5214): 1158-61.

Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. 2006. "A Fast Learning Algorithm for Deep Belief Nets." Neural Computation 18 (7): 1527-54.

Hron, Jiri, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. 2020. "Infinite Attention: NNGP and NTK for Deep Attention Networks." In ICML.

Hyvärinen, Aapo. 2005. "Estimation of Non-Normalized Statistical Models by Score Matching." Journal of Machine Learning Research 6: 695-709.

Izmailov, Pavel, Sharad Vikram, Matthew D. Hoffman, and Andrew Gordon Wilson. 2021. "What Are Bayesian Neural Network Posteriors Really Like?" In ICML.

Jacot, Arthur, Franck Gabriel, and Clement Hongler. 2018. "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." In NeurIPS.

Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Edited by G. Larry Bretthorst. Cambridge University Press.

Jordan, Michael I., Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. "An Introduction to Variational Methods for Graphical Models." Machine Learning 37 (2): 183-233.

Kalman, R. E. 1960. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering 82 (1): 35-45.

Kendall, Alex, and Yarin Gal. 2017. "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" In NeurIPS.

Kingma, Diederik P., Tim Salimans, and Max Welling. 2015. "Variational Dropout and the Local Reparameterization Trick." In NeurIPS.

Kingma, Diederik P., and Max Welling. 2014. "Auto-Encoding Variational Bayes." In ICLR.

Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Kumar, Ananya, Percy Liang, and Tengyu Ma. 2019. "Verified Uncertainty Calibration." In Advances in Neural Information Processing Systems. Vol. 32. https://arxiv.org/abs/1909.10155.

Lafferty, John D., Andrew McCallum, and Fernando C. N. Pereira. 2001. "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), 282-89.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. "Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles." In NeurIPS.

Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. "Deep Neural Networks as Gaussian Processes." In ICLR.

Liang, Shiyu, Yixuan Li, and R. Srikant. 2018. "Enhancing the Reliability of Out-of-Distribution Image Detection in Neural Networks." In International Conference on Learning Representations. https://arxiv.org/abs/1706.02690.

Liu, Weitang, Xiaoyun Wang, John D. Owens, and Yixuan Li. 2020. "Energy-Based Out-of-Distribution Detection." In NeurIPS.

Louizos, Christos, and Max Welling. 2017. "Multiplicative Normalizing Flows for Variational Bayesian Neural Networks." In ICML.

MacKay, David J. C. 1992. "A Practical Bayesian Framework for Backpropagation Networks." Neural Computation 4 (3): 448-72.

- - - . 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press.

Maddox, Wesley J., Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. 2019. "A Simple Baseline for Bayesian Uncertainty in Deep Learning." In NeurIPS.

Matthews, Alexander G. de G., Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. 2018. "Gaussian Process Behaviour in Wide Deep Neural Networks." In ICLR.

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/book1.html.

Neal, Radford M. 1996. Bayesian Learning for Neural Networks. Springer.

Neal, Radford M., and Geoffrey E. Hinton. 1998. "A View of the EM Algorithm That Justifies Incremental, Sparse, and Other Variants." In Learning in Graphical Models, 355-68. Kluwer Academic Publishers.

Novak, Roman, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. 2020. "Neural Tangents: Fast and Easy Infinite Neural Networks in Python." In ICLR.

Novak, Roman, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2019. "Bayesian Deep Convolutional Networks with Many Channels Are Gaussian Processes." In ICLR.

Ober, Sebastian W., Carl Edward Rasmussen, and Mark van der Wilk. 2021. "The Promises and Pitfalls of Deep Kernel Learning." In UAI, 1206-16.

Osband, Ian. 2016. "Risk Versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout." In NeurIPS Workshop on Bayesian Deep Learning.

Osband, Ian, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. 2016. "Deep Exploration via Bootstrapped DQN." In NeurIPS.

Phan, Du, Neeraj Pradhan, and Martin Jankowiak. 2019. "Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro." arXiv Preprint arXiv:1912.11554. https://arxiv.org/abs/1912.11554.

Platt, John. 1999. "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." In Advances in Large Margin Classifiers, edited by A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, 61-74. MIT Press.

Rabiner, Lawrence R. 1989. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of the IEEE 77 (2): 257-86.

Rahimi, Ali, and Benjamin Recht. 2007. "Random Features for Large-Scale Kernel Machines." In Advances in Neural Information Processing Systems (NeurIPS). Vol. 20. https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.

Ren, Jie, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, and Balaji Lakshminarayanan. 2019. "Likelihood Ratios for Out-of-Distribution Detection." In Advances in Neural Information Processing Systems. Vol. 32. https://arxiv.org/abs/1906.02845.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2015. "Variational Inference with Normalizing Flows." In ICML.

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." In ICML.

Romano, Yaniv, Evan Patterson, and Emmanuel J. Candès. 2019. "Conformalized Quantile Regression." In Advances in Neural Information Processing Systems. Vol. 32. https://arxiv.org/abs/1905.03222.

Salakhutdinov, Ruslan, and Geoffrey E. Hinton. 2009. "Deep Boltzmann Machines." In AISTATS, 448-55.

Salimbeni, Hugh, Ching-An Cheng, Byron Boots, and Marc Deisenroth. 2018. "Orthogonally Decoupled Variational Gaussian Processes." In NeurIPS.

Salimbeni, Hugh, and Marc Peter Deisenroth. 2017. "Doubly Stochastic Variational Inference for Deep Gaussian Processes." In NeurIPS, 4588-99.

Salimbeni, Hugh, Stefanos Eleftheriadis, and James Hensman. 2018. "Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models." In AISTATS.

Tibshirani, Ryan J., Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas. 2019. "Conformal Prediction Under Covariate Shift." In Advances in Neural Information Processing Systems. Vol. 32. https://arxiv.org/abs/1904.06019.

Titsias, Michalis K. 2009. "Variational Learning of Inducing Variables in Sparse Gaussian Processes." In AISTATS, 567-74.

Tran, Dustin, Matthew D. Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul, Matthew J. Johnson, and Rif A. Saurous. 2018. "Simple, Distributed, and Accelerated Probabilistic Programming." Advances in Neural Information Processing Systems 31. https://arxiv.org/abs/1811.02091.

Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer. 2005. Algorithmic Learning in a Random World. First. Springer.

Wen, Yeming, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. 2018. "Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches." In ICLR.

Wilson, Andrew Gordon, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. 2016a. "Deep Kernel Learning." In AISTATS, 370-78.

- - - . 2016b. "Stochastic Variational Deep Kernel Learning." In NeurIPS, 2586-94.

Wilson, Andrew Gordon, and Pavel Izmailov. 2020. "Bayesian Deep Learning and a Probabilistic Perspective of Generalization." NeurIPS.

Wilson, Andrew Gordon, and Hannes Nickisch. 2015. "Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP)." In ICML, 1775-84.

Yang, Greg. 2019. "Tensor Programs i: Wide Feedforward or Recurrent Neural Networks of Any Architecture Are Gaussian Processes." In NeurIPS.

Cubic exact-GP cost: the $N \times N$ kernel-matrix inversion is $\mathcal{O}(N^3)$ FLOPs, with $\mathcal{O}(N^2)$ memory for storage; this caps full-GP training at roughly $N \approx 5{,}000$ on commodity hardware and is the single fact that motivates the entire sparse / structured-kernel / matrix-vector-iteration literature.↩︎
Inducing-point budget: too few $M$ loses accuracy near the boundaries; too many $M$ defeats the $\mathcal{O}(NM^2)$ saving over full-GP $\mathcal{O}(N^3)$ ; placement (uniform grid, k-means on inputs, or learned by gradient on the ELBO) and $M$ itself remain hand-tuned per benchmark (Titsias 2009; Hensman, Fusi, and Lawrence 2013).↩︎
Deep-GP pinch-point: the intermediate latent of a stacked deep GP collapses to near-deterministic without identity initialization or residual / skip-connections, defeating the compositional-prior motivation that justified depth in the first place (Salimbeni and Deisenroth 2017; Dutordoir, Durrande, and Hensman 2021).↩︎
Deep-GP depth saturation: 2 layers help modestly, 3 layers help slightly more, and beyond 3 the held-out NLL on UCI regression saturates and shows mild overfit; doubling the layer count past 3 burns compute without proportional gain (Salimbeni and Deisenroth 2017).↩︎
DKL variance collapse: the feature extractor maps training inputs to coincident latent points so that $k_{\text{base}}(h_\theta(x_i), h_\theta(x_j)) \to k_{\text{base}}(0)$ for nearby $x_i, x_j$ , destroying the calibrated uncertainty that motivated using a GP read-out in the first place (Ober, Rasmussen, and Wilk 2021).↩︎
Marginal-likelihood overfitting: ML-II hyperparameter tuning treats the marginal likelihood as a model-selection score, but with the deep / DKL parameterization the kernel hyperparameters and feature-extractor weights have enough degrees of freedom to overfit the held-out fold's marginal likelihood, especially in low-data regimes (Ober, Rasmussen, and Wilk 2021).↩︎
Edge-of-chaos initialisation: NNGP/NTK kernels are useless in the ordered regime ( $\rho \to 1$ , all inputs map to the same fixed point) or the chaotic regime ( $\rho \to \rho_*$ , correlations decorrelate exponentially with depth); only the critical-line $(\sigma_w^2, \sigma_b^2)$ tuning yields informative deep kernels (Lee et al. 2018).↩︎
NNGP feature-learning gap: infinite-width Bayesian inference is kernel regression with a fixed kernel, so it does not learn features the way SGD-trained finite-width networks do; on CIFAR-10 the NNGP-CNN-GAP ceiling sits below SGD-trained CNNs by roughly 5-10 points, and Myrtle-architecture NNGPs only narrow the gap with hand-engineered architecture-specific kernels (Novak et al. 2019).↩︎
Cubic exact-GP cost: the $N \times N$ kernel-matrix inversion is $\mathcal{O}(N^3)$ FLOPs, with $\mathcal{O}(N^2)$ memory for storage; this caps full-GP training at roughly $N \approx 5{,}000$ on commodity hardware and is the single fact that motivates the entire sparse / structured-kernel / matrix-vector-iteration literature.↩︎
Kernel design choice: SE, Matern-3/2, Matern-5/2, rational-quadratic, periodic, and spectral-mixture kernels each impose different inductive biases (smoothness, bandwidth, periodicity, spectral support); these choices are not transferable across datasets and constitute a hand-engineered prior whose absence in deep-learning practice is part of why DKL is attractive (Rasmussen and Williams 2006).↩︎
Inducing-point budget: too few $M$ loses accuracy near the boundaries; too many $M$ defeats the $\mathcal{O}(NM^2)$ saving over full-GP $\mathcal{O}(N^3)$ ; placement (uniform grid, k-means on inputs, or learned by gradient on the ELBO) and $M$ itself remain hand-tuned per benchmark (Titsias 2009; Hensman, Fusi, and Lawrence 2013).↩︎
Deep-GP depth saturation: 2 layers help modestly, 3 layers help slightly more, and beyond 3 the held-out NLL on UCI regression saturates and shows mild overfit; doubling the layer count past 3 burns compute without proportional gain (Salimbeni and Deisenroth 2017).↩︎
Deep-GP pinch-point: the intermediate latent of a stacked deep GP collapses to near-deterministic without identity initialization or residual / skip-connections, defeating the compositional-prior motivation that justified depth in the first place (Salimbeni and Deisenroth 2017; Dutordoir, Durrande, and Hensman 2021).↩︎
DKL variance collapse: the feature extractor maps training inputs to coincident latent points so that $k_{\text{base}}(h_\theta(x_i), h_\theta(x_j)) \to k_{\text{base}}(0)$ for nearby $x_i, x_j$ , destroying the calibrated uncertainty that motivated using a GP read-out in the first place (Ober, Rasmussen, and Wilk 2021).↩︎
Marginal-likelihood overfitting: ML-II hyperparameter tuning treats the marginal likelihood as a model-selection score, but with the deep / DKL parameterization the kernel hyperparameters and feature-extractor weights have enough degrees of freedom to overfit the held-out fold's marginal likelihood, especially in low-data regimes (Ober, Rasmussen, and Wilk 2021).↩︎
NNGP feature-learning gap: infinite-width Bayesian inference is kernel regression with a fixed kernel, so it does not learn features the way SGD-trained finite-width networks do; on CIFAR-10 the NNGP-CNN-GAP ceiling sits below SGD-trained CNNs by roughly 5-10 points, and Myrtle-architecture NNGPs only narrow the gap with hand-engineered architecture-specific kernels (Novak et al. 2019).↩︎
Edge-of-chaos initialisation: NNGP/NTK kernels are useless in the ordered regime ( $\rho \to 1$ , all inputs map to the same fixed point) or the chaotic regime ( $\rho \to \rho_*$ , correlations decorrelate exponentially with depth); only the critical-line $(\sigma_w^2, \sigma_b^2)$ tuning yields informative deep kernels (Lee et al. 2018).↩︎
Kernel design choice: SE, Matern-3/2, Matern-5/2, rational-quadratic, periodic, and spectral-mixture kernels each impose different inductive biases (smoothness, bandwidth, periodicity, spectral support); these choices are not transferable across datasets and constitute a hand-engineered prior whose absence in deep-learning practice is part of why DKL is attractive (Rasmussen and Williams 2006).↩︎
Non-Gaussian-likelihood approximation: classification, count regression, and robust regression break the conjugate Gaussian-prior + Gaussian-likelihood closed-form posterior, forcing a choice between Laplace (mode-fitting bias), expectation propagation (heuristic stability), and variational inference (mean-seeking bias) with no clean theoretical winner (Rasmussen and Williams 2006).↩︎
Deep-GP depth saturation: 2 layers help modestly, 3 layers help slightly more, and beyond 3 the held-out NLL on UCI regression saturates and shows mild overfit; doubling the layer count past 3 burns compute without proportional gain (Salimbeni and Deisenroth 2017).↩︎
Conformal exchangeability requirement: split conformal's marginal-coverage guarantee assumes calibration and test points are exchangeable (e.g., i.i.d. from the same distribution); the guarantee fails outright under covariate shift, label shift, or temporal drift. Weighted conformal (Tibshirani et al. 2019) restores coverage when the shift's likelihood ratio is known, but the ratio itself is rarely available in practice.↩︎
ECE bin-choice sensitivity: the standard 15-bin equal-width estimator can differ from a 15-bin equal-mass estimator by 0.5-1.0 percentage points on CIFAR-100 ResNets (Kumar, Liang, and Ma 2019), so reported ECE numbers are only comparable across papers that fix the same binning scheme. Adaptive binning (one bin per quantile, plus debiased estimators) reduces variance but is not yet the field default.↩︎
Near-OOD vs far-OOD: the CIFAR-10 vs SVHN benchmark is far-OOD (different image domain, different label space) and is empirically easy; near-OOD benchmarks like CIFAR-10 vs CIFAR-100 (same image domain, disjoint label space) are 10-20 AUROC points harder for every detector and are arguably more diagnostic of deployment-time behavior. Reporting only far-OOD numbers (the historical default) overstates progress.↩︎