← Back to home

Statistical Deep Learning

PAC-Bayes, NTK, benign overfitting / double descent, and scaling laws

Classical Statistical Learning Foundations

This chapter catalogs the references covered in Part I, grouping them by topic and cross-linking to the chapter where each is discussed in detail. The goal is to make the prerequisite structure explicit: a reader who has seen, for instance, the Hoeffding inequality in passing can skip the concentration chapter; a reader who has not should follow the link before tackling VC/Rademacher bounds.

The unifying thread of Part I is the uniform-convergence paradigm: relate the population risk of the ERM hypothesis to its empirical risk via a capacity term that depends on the hypothesis class, then choose a class whose capacity is provably small. Every paper below contributes either a capacity measure (VC dimension, Rademacher complexity, fat-shattering dimension, margin), a tail inequality that turns expectations into high-probability statements (Hoeffding, Bernstein, McDiarmid), or a concrete algorithm whose generalization can be analyzed inside the framework (SVM, AdaBoost, kernel ridge regression). The later Parts of the book will show how each of these components either breaks or bends when the hypothesis class becomes a deep neural network trained by SGD.

Lineage of the classical theory

The statistical reading of supervised learning begins with the uniform convergence theorem of (Vapnik and Chervonenkis 1971), which bounds the gap between empirical and population error uniformly over a hypothesis class and, in the same stroke, introduces the growth function and the VC dimension as the first general capacity measures; this is the result on which the VC / Rademacher chapter is built. (Vapnik 1995) later organizes these bounds for both classification and regression into the Structural Risk Minimization programme, the prescription to trade empirical fit against capacity by minimizing over a nested family of classes; that programme frames both the ERM and MLE and VC / Rademacher chapters. The modern textbook synthesis of this uniform-convergence view (Rademacher complexity, VC, margin bounds, kernels) is (Mohri, Rostamizadeh, and Talwalkar 2018), used as the standing reference throughout ERM, VC, kernels, SVM, and boosting.

The capacity side of the theory was then sharpened in two directions. (Bartlett and Mendelson 2002) replaces the distribution-free VC dimension with the data-dependent Rademacher and Gaussian complexities and proves the symmetrization bound that is now standard, yielding estimates that adapt to the sample rather than to the worst case. In parallel, the capacity of neural networks specifically was estimated first by (Baum and Haussler 1989), whose VC-based sample-complexity count for threshold feed-forward networks is the earliest such result, and then collected with the pseudo-dimension and fat-shattering refinements in the monograph of (Anthony and Bartlett 1999); all three feed the VC / Rademacher chapter.

Underneath every such bound sits a tail inequality that converts an expectation into a high-probability statement. (Boucheron, Lugosi, and Massart 2013) is the standard reference for this machinery (Hoeffding, Bernstein, the bounded-differences inequality, the entropy method, and log-Sobolev arguments), and (Murphy 2023) gives the practitioner-oriented digest of the same tools as they feed PAC and PAC-Bayes analyses; both are treated in the Concentration chapter.

A complementary, estimation-theoretic account of the same fitting procedures comes from classical asymptotics. (Vaart 1998) develops the consistency, the n\sqrt{n} rate, and the limiting normal law of M-estimators and the maximum-likelihood estimator, while (Murphy 2022 Ch 5) supplies the Bayesian-probabilistic framing in which ERM, MLE, and MAP differ only by their regularizer; both inform the ERM and MLE chapter.

The kernel branch of the field gives the theory a function-space home. (Schölkopf and Smola 2002) is the canonical account of positive-definite kernels, the reproducing-kernel Hilbert space, the representer theorem, and their use in the SVM and kernel PCA, and (Rasmussen and Williams 2006) is its Bayesian counterpart, developing Gaussian-process covariance functions, posterior inference, the marginal likelihood, and the equivalence to kernel ridge regression; the Kernels chapter draws on both.

These pieces meet in the support vector machine, the algorithm that made the margin theory concrete. (Boser, Guyon, and Vapnik 1992) introduces the kernel trick in the dual of the margin classifier, allowing nonlinear boundaries with no explicit feature map; (Cortes and Vapnik 1995) adds the slack variables and the hinge loss of the soft-margin formulation, extending the method to non-separable data; and (Platt 1999) makes it practical with Sequential Minimal Optimization, a pairwise coordinate-ascent scheme that solves the quadratic program without a general QP solver. The SVM chapter follows exactly this progression.

Boosting supplies the other classical algorithm and, with it, the first clean case of benign overfitting. (Freund and Schapire 1997) introduces AdaBoost and bounds its training error by the product of the per-round normalization constants, and (Schapire et al. 1998) then explains the algorithm's most puzzling behavior, the continued improvement of test error after the training error reaches zero, with a bound that depends on the margin distribution rather than on the number of rounds; the Boosting chapter develops both. That last observation, capacity controlled through a margin rather than through a parameter count, is precisely the thread the later Parts pick up once the classifier becomes an overparameterized network.

SoTA table: classical benchmarks

Classical learning theory crystallized around a handful of algorithm families on a stable benchmark suite (MNIST digits, USPS, UCI Letter, UCI Pima/Spambase). Each family carries a signature failure mode that the uniform-convergence framework either explains or fails to explain, and these failures are exactly the entry points for the post-classical theory developed in Parts II-V. The leaderboard below catalogs the headline result for each family with hyperparameters in parentheses; the prose around it walks through the families chronologically and attaches footnote markers at the clause where each failure mode first becomes visible.

The intro paragraph for Part I already framed uniform convergence as the central paradigm. Three structural problems sit immediately under that paradigm and recur across every algorithm in the leaderboard. VC and Rademacher capacity bounds, applied directly to over-parameterized networks, produce sample-complexity estimates that exceed the dataset size by many orders of magnitude1; the data-dependent Rademacher refinement (Bartlett and Mendelson 2002) tightens this but still cannot certify the gap on modern nets2. AdaBoost's empirical resilience to overfitting after training error reaches zero forced a reformulation of the bound in terms of the margin distribution rather than the round count3 (Schapire et al. 1998); the reformulation transferred to kernel-SVM analysis because both are large-margin classifiers in a (possibly infinite-dimensional) feature space (Cortes and Vapnik 1995). The choice of kernel itself is an inductive bias (Schölkopf and Smola 2002 Ch 2): each kernel encodes a hypothesis class with a specific decay rate of Mercer eigenvalues, and there is no principled data-driven selection criterion4. ERM's i.i.d. assumption (Vapnik 1995) also fails outside the benchmark suite (time-series, distribution shift, covariate-shift settings)5, and the concentration tools that turn expectations into high-probability statements (Hoeffding, Bernstein, McDiarmid) tighten only when variance is bounded; heavy-tailed loss surfaces break the standard guarantee6. Two further structural points: classical theory analyzes capacity in isolation from the training algorithm7, yet SGD's optimization geometry contributes its own implicit regularization (Part IV); and SMO (Platt 1999), the workhorse SVM solver, scales as O(N2)O(N^2) in the Gram matrix and becomes impractical past N105N \approx 10^58, a dataset size that modern deep-learning pipelines exceed by 3-5 orders of magnitude.

Classical SoTA leaderboard. Hypothesis class summarizes the algorithm's RKHS or model family. Loss / Algo cell uses (key=value) for headline hyperparameters. Data column lists the training set size where reported. MNIST / USPS / UCI Letter columns report test-error percent under the experimental protocol of the cited paper; protocols differ across papers, so cross-column differences within a couple of percent are not meaningful. Notes column flags scale or methodology caveats. Dashes mark unreported settings.
MNIST USPS UCI Lett
Method Year Hypothesis class Loss / Algo Data err % err % err % Notes
LeNet-1 (Cortes and Vapnik 1995) 1995 CNN-feedforward SGD on softmax CE MNIST 60k 1.70 ~5.0 - baseline
soft-margin SVM (Cortes and Vapnik 1995) 1995 kernel-RKHS hinge SVM(C=10, poly d=4) MNIST 60k 1.10 4.00 - dual QP
hard-margin SVM (Boser, Guyon, and Vapnik 1992) 1992 kernel-RKHS hinge SVM(C=C{=}\infty, poly d=2) OCR 7.3k - - - sep-only
Gaussian SVM (Schölkopf and Smola 2002) 2002 kernel-RKHS hinge SVM(RBF, γ\gamma=2, C=10) MNIST 60k 1.40 4.20 - bw tuned
polynomial SVM d=9 (Schölkopf and Smola 2002) 2002 kernel-RKHS hinge SVM(poly d=9, C=10) MNIST 60k 1.10 - - best-of
logistic regression (Hastie, Tibshirani, and Friedman 2009 Ch 4) 2009 linear-margin logit-CE(2\ell_2, λ\lambda=1e-3) MNIST 60k 7.60 9.50 - linear
AdaBoost stumps (Freund and Schapire 1997) 1997 boosted-trees AdaBoost(T=100, depth-1) UCI Letter - - 6.60 weak
AdaBoost C4.5 (Freund and Schapire 1997) 1997 boosted-trees AdaBoost(T=100, C4.5 base) UCI Letter - - 3.10 strong
AdaBoost trees T=1000 (Schapire et al. 1998) 1998 boosted-trees AdaBoost(T=1000, depth-1) UCI Letter - - 3.10 margin
1-NN baseline (Hastie, Tibshirani, and Friedman 2009 Ch 13) 2009 non-parametric 1-NN Euclidean MNIST 60k 3.09 5.20 - k=1
k-NN k=3 (Hastie, Tibshirani, and Friedman 2009 Ch 13) 2009 non-parametric k-NN(k=3, Euclidean) MNIST 60k 2.83 4.95 - k=3
2-hidden-layer MLP (Cortes and Vapnik 1995) 1995 MLP SGD on softmax CE MNIST 60k 3.05 - - 300+100

The table groups the classical literature into five families (linear-margin, kernel-RKHS, boosted-trees, CNN-feedforward, non-parametric), walked through in chronological order. The margin cluster begins with Boser-Guyon-Vapnik's hard-margin formulation (Boser, Guyon, and Vapnik 1992) which only applies to linearly separable data; Cortes and Vapnik (Cortes and Vapnik 1995) introduced soft margins with slack variables and hinge loss, making SVMs applicable to the bulk of real benchmarks. The polynomial-kernel SVM at d=4d=4 reaches 1.1% MNIST test error, narrowly beating the contemporary LeNet-1 convnet baseline at 1.7%; this result was the headline selling point of kernel methods in the late 1990s. Gaussian-kernel SVMs (Schölkopf and Smola 2002) achieve a comparable 1.4% on MNIST with cross-validated bandwidth, but the choice between polynomial and RBF reflects a kernel-design judgement rather than a principled criterion9. SMO (Platt 1999) made these results computationally tractable by avoiding a general QP solver, yet the underlying O(N2)O(N^2) Gram-matrix scaling10 put a hard ceiling on dataset size that has only been broken by random features and sketching methods (Part V).

The boosting cluster begins with AdaBoost (Freund and Schapire 1997), whose training-error bound (the product of normalization constants) decays exponentially in the number of rounds. The empirical surprise was that test error on UCI Letter kept improving from round 5 (training error already at 0) through round 1000, an apparent violation of any naive capacity-based bound11. Schapire et al. (Schapire et al. 1998) resolved the puzzle by bounding test error in terms of the margin distribution rather than the round count: the cumulative fraction of training points with margin below a threshold θ\theta provides a class-capacity-times-Lipschitz bound that improves as the boosting iterates concentrate the margin mass to the right of θ\theta. The mechanism is structurally similar to the SVM-margin analysis but operates over a randomized hypothesis class rather than a single max-margin classifier.

The table separates four eras: kernel-margin (1992-2002), boosting (1997-1998), the early-CNN baseline (1995, included for honest comparison), and the non-parametric reference points (1-NN, k-NN). Three readings frame the rest of the chapter. First, on MNIST the kernel-SVM family and the early-CNN family converge to within roughly half a percentage point in the 1995-2002 window, which made the choice between them appear as a matter of inductive-bias preference rather than capacity. Second, AdaBoost's behaviour on UCI Letter is a clean falsification of any pure round-count or pure capacity-count bound: the margin distribution explanation (Schapire et al. 1998) is the historical bridge that connects boosting to the SVM-margin analysis and ultimately to the implicit-regularization story of Part IV. Third, every entry above relies on an i.i.d. sampling assumption13 and a capacity notion that ignores the optimizer14; both assumptions break in modern deep learning, which is the entry point for Parts II (PAC-Bayes) and III (NTK / mean-field).

PAC-Bayes and Modern Generalization Theory

Modern generalization theory begins with a puzzle: deep networks fit random labels with zero training error yet still generalize when the labels are real (Zhang et al. 2017). By any classical capacity measure (VC dimension, parameter count, Rademacher complexity on a norm ball wide enough to memorize), the hypothesis class is too rich for a non-trivial test-error bound; the random-label baseline is therefore not a bound but a sharp falsification of capacity-based explanations15. Four resolutions emerged, each with its own well-known structural problems walked through paragraph-by-paragraph after the SoTA leaderboard below: PAC-Bayes, spectrally-normalized margin bounds, information-theoretic mutual-information bounds, and sharpness-based empirical predictors. A fifth strand is the impossibility result of (Nagarajan and Kolter 2019), which proves a formal obstruction to the entire uniform-convergence program.

The PAC-Bayes resolution (section PAC-Bayes Framework) replaces worst-case analysis over a hypothesis class with a weighted average over an algorithm-dependent posterior QQ; the bound depends only on the KL divergence KL(QP)\mathrm{KL}(Q \Vert P) between QQ and a data-independent prior PP. Introduced by (McAllester 1999), refined by (Seeger 2002) for Gaussian-process classifiers via a binary-kl form, and systematized by (Catoni 2007) into a thermodynamic formalism, the framework was first made non-vacuous on a realistic deep network by (Dziugaite and Roy 2017) using a SGD-optimized Gaussian posterior on a 3-layer MLP16. (Neyshabur, Bhojanapalli, and Srebro 2018) connected PAC-Bayes with spectral norms via a perturbation argument; the resulting bound remains vacuous on production-scale ConvNets because the KL/n\sqrt{\mathrm{KL}/n} term blows up under the huge KL between random initialization and the trained weights17, and tightening the KL via a posterior shift toward zero conflicts with low train risk18. A second-generation refinement uses data-dependent priors (priors learned on a held-out split) (Dziugaite and Roy 2018)(Perez-Ortiz et al. 2021), at the cost that any data dependence in PP inflates the bound or breaks the proof19.

The spectrally-normalized margin resolution (section Spectrally-Normalized Margin Bounds) extends (Schapire et al. 1998)'s margin theory for boosting to neural networks: (Neyshabur, Tomioka, and Srebro 2015) gave the first product-of-Frobenius bound, (Bartlett, Foster, and Telgarsky 2017) sharpened it to the product of spectral norms divided by the classification margin. The bound is conceptually clean but suffers from spectral-norm vacuity: W\prod_\ell \|W_\ell\| grows polynomially in depth and the bound is vacuous beyond a small architecture20. The information-theoretic resolution (section Information-Theoretic Generalization Bounds) was initiated by (Russo and Zou 2016) in the adaptive-data-analysis literature and developed into a general tool by (Xu and Raginsky 2017), who replaced capacity with the mutual information I(S;W)I(S; W) between the training sample and the learned weights. (Bu, Zou, and Veeravalli 2020) sharpened the bound using per-sample MI I(Zi;W)/nI(Z_i; W)/n, which is tighter but harder to estimate21; (Haghifam et al. 2020) gave a conditional MI (CMI) refinement22, and (Pensia, Jog, and Loh 2018) applied the framework to Langevin dynamics where the MI admits a closed form.

A complementary line, orthogonal to capacity in the PAC sense, asks how parameter count itself relates to robust interpolation. (Bubeck and Sellke 2021) prove the Universal Law of Robustness via an isoperimetric argument: any class of pp-parameter Lipschitz models that interpolates nn labelled points on a distribution satisfying isoperimetry (Gaussians, uniform distributions on spheres, manifolds with positive Ricci curvature all qualify) must satisfy pndp \gtrsim n d up to logarithmic factors, where dd is the ambient data dimension. Put differently, the 𝒪(n)\mathcal{O}(n) parameters sufficient for non-robust interpolation are provably insufficient for smooth (i.e., Lipschitz-bounded) interpolation by a multiplicative factor of dd. The bound is tight: radial-basis-function constructions with p=n(d+1)p = n(d+1) parameters achieve 𝒪(1)\mathcal{O}(1)-Lipschitz interpolation, confirming that overparameterization by exactly dd is both necessary and sufficient.

The isoperimetry assumption controls how Lipschitz functions concentrate on the data distribution: for an LL-Lipschitz ff, it requires Pr[|f(x)𝔼[f]|t]2exp(dt2/2cL2)\Pr[|f(x) - \mathbb{E}[f]| \geq t] \leq 2\exp(-dt^2 / 2cL^2). The condition is the reason the bound's factor of dd is ambient dimension and not intrinsic manifold dimension (a distinction that matters in practice, since ImageNet data lives in 150528\mathbb{R}^{150528}, not in a low-dimensional manifold for the purposes of adversarial perturbations measured in 2\ell_2).

The lower bound is algorithm-independent: it applies to any procedure returning a Lipschitz interpolator, so neither implicit bias nor training dynamics can rescue a model from the pndp \gtrsim nd constraint. The bound gives a quantitative explanation of the (Szegedy et al. 2014) fragility observation (imperceptible adversarial perturbations fool AlexNet, with average 2\ell_2 distortion as small as 0.0065 on ImageNet): under-parameterized-for-robustness networks cannot achieve small Lipschitz constants even when they generalize cleanly on unperturbed inputs. Bubeck and Sellke estimate that robust ImageNet models may require 101010^{10} - 101110^{11} parameters, substantially more than current architectures, which is consistent with the persistent failure of adversarial training to close the accuracy-robustness gap at scale.

The sharpness-based empirical audit (section Empirical Audit of Bound Tightness) is the negative result: (Jiang et al. 2020) evaluated more than forty complexity measures across ten thousand trained networks. The most predictive measures (by Kendall-tau rank correlation with the true test gap) are sharpness / PAC-Bayes variants, not the norm-products that dominate the theory; the spectral-norm product is negatively correlated with the gap across realistic hyperparameter sweeps2324. The impossibility result of (Nagarajan and Kolter 2019) sharpens this picture: no two-sided uniform-convergence bound can be non-vacuous on certain linear classifiers under realistic SGD25; (Negrea, Dziugaite, and Roy 2020) partially defends uniform convergence by restricting attention to a hypothesis-class subset.

PAC-Bayes / generalization-theory SoTA leaderboard. Method links to the chapter section and cites the canonical paper. Bound class = the family (PAC-Bayes-classical, PAC-Bayes-spectral, PAC-Bayes-data-dep, Margin-spectral, Norm-product, Information-MI, Information-per-sample, Information-CMI, Sharpness-empirical, Impossibility, Random-label-baseline). Form = the schematic recipe or formula. Benchmark = the model / dataset reported in the original paper. Bound value = the headline number on a [0,1] loss (vacuous = greater than 1; analytic = no fielded number, only a closed form). Kendall τ\tau is from (Jiang et al. 2020) across \sim10,000 CNNs on CIFAR-10 / SVHN where available; - means the paper reports no rank-correlation experiment. Comment summarizes the principal limitation. Numbers are paper-reported headline values; settings differ across rows and only same-row reads are consistent.
Quality Scope
Paper Year Bound class Form Benchmark Bound value Kendall τ\tau Comment
McAllester (McAllester 1999) 1999 PAC-Bayes-classical (KL+log(2n/δ))/(2n)\sqrt{(\mathrm{KL}+\log(2\sqrt{n}/\delta))/(2n)} toy classifiers framework - foundational; not fielded on a deep net
Seeger (Seeger 2002) 2002 PAC-Bayes-classical binary-kl inversion GP / Pima Indians (n=768) 0.29 - tight near zero empirical risk; GP-only setup
Catoni (Catoni 2007) 2007 PAC-Bayes-classical Legendre / Gibbs Qβ*Q_\beta^* toy / theoretical analytic - unifies square-root and binary-kl via β\beta-tuning
Lacasse C-bound (Lacasse et al. 2007) 2007 PAC-Bayes-classical majority-vote derandomization toy / theoretical analytic - tightens Gibbs to deterministic conversion
Neyshabur norm (Neyshabur, Tomioka, and Srebro 2015) 2015 Norm-product WF\prod_\ell \Vert W_\ell \Vert_F small MLP / MNIST vacuous - first norm-based DNN bound; vacuous on real archs
Russo-Zou (Russo and Zou 2016) 2016 Information-MI 2σ2I(S;W)/n\sqrt{2 \sigma^2 I(S;W)/n} adaptive estimator analytic - initiated MI-based bounds for adaptive analysis
Xu-Raginsky (Xu and Raginsky 2017) 2017 Information-MI 2σ2I(S;W)/n\sqrt{2 \sigma^2 I(S;W)/n} Gibbs algorithm analytic - first general MI bound; estimator-free for Gibbs
Dziugaite-Roy (Dziugaite and Roy 2017) 2017 PAC-Bayes-data-dep SGD-optimized Gaussian posterior 3-layer MLP / MNIST T-600 0.161 - only fielded non-vacuous DNN bound; small-net only
Bartlett-Foster-Telgarsky (Bartlett, Foster, and Telgarsky 2017) 2017 Margin-spectral W2\prod_\ell \Vert W_\ell \Vert_2 / margin AlexNet / CIFAR-10 vacuous -0.08 Kendall τ\tau negative; counterintuitive sign
Zhang random-label (Zhang et al. 2017) 2017 Random-label-baseline train acc with random labels Inception / CIFAR-10 100% - not a bound but a falsification of capacity
Neyshabur PB-spectral (Neyshabur, Bhojanapalli, and Srebro 2018) 2018 PAC-Bayes-spectral perturbation + spectral product VGG / CIFAR-10 vacuous - depth-dependent KL blows up
Dziugaite data-dep (Dziugaite and Roy 2018) 2018 PAC-Bayes-data-dep data-split prior PSP_{S'} MNIST / CIFAR-10 0.46 - partial-data prior tightens KL; sample-splitting cost
Pensia-Langevin (Pensia, Jog, and Loh 2018) 2018 Information-MI I(S;W)I(S;W) under Langevin SDE Langevin SGD / synthetic analytic - closed-form MI for noisy GD
Lei algorithmic (Lei and Ying 2020) 2019 Information-MI algorithmic stability to MI SGD / convex losses analytic - bridges stability and MI bounds
Nagarajan-Kolter (Nagarajan and Kolter 2019) 2019 Impossibility two-sided UC lower bound linear / Gaussian mixture vacuous (proof) - rules out tight UC for SGD outputs
Bu-Zou-Veeravalli (Bu, Zou, and Veeravalli 2020) 2020 Information-per-sample (1/n)iI(Zi;W)(1/n) \sum_i I(Z_i; W) Gaussian mean estimator analytic - tighter than I(S;W)I(S;W); estimation hard in DNN
Haghifam CMI (Haghifam et al. 2020) 2020 Information-CMI conditional MI on supersample SGD / MNIST analytic - smaller than II; estimator still required
Negrea UC defense (Negrea, Dziugaite, and Roy 2020) 2020 PAC-Bayes-data-dep restricted-class UC linear models analytic - partial rebuttal to Nagarajan-Kolter
Esposito Renyi (Esposito, Gastpar, and Issa 2021) 2020 Information-MI Renyi-divergence bound Gibbs / theoretical analytic - extends MI to Renyi-α\alpha divergences
Jiang sharpness-magnitude (Jiang et al. 2020) 2020 Sharpness-empirical flat ratio ×\times magnitude \sim10k CNN / CIFAR-10 - 0.484 top-ranked predictor across the sweep
Jiang PB-sharpness (Jiang et al. 2020) 2020 Sharpness-empirical PAC-Bayes flat ratio at init \sim10k CNN / CIFAR-10 - 0.318 second-tier predictor
Jiang spectral-product (Jiang et al. 2020) 2020 Norm-product W2\prod_\ell \Vert W_\ell \Vert_2 \sim10k CNN / CIFAR-10 - -0.076 negatively correlated; theory-empirical mismatch
Perez-Ortiz tight (Perez-Ortiz et al. 2021) 2021 PAC-Bayes-data-dep data-split prior + Catoni-β\beta CIFAR-10 / ImageNet32 0.179 - best fielded PAC-Bayes on CIFAR-10
Alquier tutorial (Alquier 2024) 2024 PAC-Bayes-classical survey of all variants bibliographic - - reference state-of-the-field 2024

The table groups the literature into eight bound families, walked through in chronological order. Classical PAC-Bayes (McAllester, Seeger, Catoni, Lacasse) gave the foundational machinery (square-root form, binary-kl inversion, Legendre/Gibbs construction, C-bound derandomization) but was applied only to toy classifiers and Gaussian-process models; the bounds are tight asymptotically but were never fielded on a deep network because the KL between a deep-net random init and its SGD-trained weights is far too large for a useful absolute bound2627. The Dziugaite-Roy line broke this barrier by SGD-optimizing the bound itself on a 3-layer MLP28, reaching 0.161 on binary MNIST T-600; Perez-Ortiz tightened the recipe to 0.179 on full CIFAR-10 by combining a data-split prior (legal because PP does not see the training fold) with a tuned Catoni β\beta29.

The spectrally-normalized margin line (Bartlett-Foster-Telgarsky, Neyshabur PB-spectral) unifies VC-style class capacity with margin theory. The bound class is conceptually appealing because it depends only on weight-matrix norms and the margin (no QQ-design), but the spectral product W2\prod_\ell \|W_\ell\|_2 is empirically too large at production scale30; in the (Jiang et al. 2020) sweep this family scored Kendall τ0.08\tau \approx -0.08, the worst rank correlation among 40+ measures and one of the few negatively correlated with the test gap. Norm-product capacity, despite being the quantity the margin theorems control, is therefore a worse predictor of generalization than a coin flip.

The information-theoretic line (Russo-Zou, Xu-Raginsky, Bu-Zou-Veeravalli, Haghifam CMI, Pensia, Lei, Esposito) replaces hypothesis-class capacity with the mutual information I(S;W)I(S; W) between the training sample and the learned weights, and progressively tightens it: per-sample I(Zi;W)/nI(Z_i; W)/n is sharper than I(S;W)/nI(S; W)/n31, conditional MI on a supersample is sharper still32, and Langevin / SGD-noise dynamics give analytic closed forms. None of these has been fielded on a deep ConvNet: estimating I(S;W)I(S; W) in dimensions exceeding 10610^6 remains an open problem, and the bounds are useful as theoretical guarantees rather than numerical certificates.

The sharpness-empirical line (Jiang and co-authors) is conceptually a step back from formal bound-proving: it replaces "what is the absolute test gap?" with "which capacity-like measure best predicts the gap across a hyperparameter sweep?" This reframing is what produces the negative result for spectral-norm capacity and the positive result for sharpness-magnitude. Its weakness is flatness fragility: pointwise reparameterization can rescale sharpness arbitrarily without changing the predictor function (Dinh et al.)36, so any sharpness-based measure must be carefully normalized to be invariant under reparameterization; the field has not converged on a single such normalization.

The impossibility result and its partial rebuttals form the methodological coda. (Nagarajan and Kolter 2019) prove that on a linear-classifier mixture-of-Gaussians problem, no two-sided uniform-convergence bound applied to the SGD-output distribution can be tight: the bound is forced to be vacuous regardless of how the hypothesis class is parameterized37. (Negrea, Dziugaite, and Roy 2020) partially defends UC by restricting attention to a strict hypothesis-class subset that the algorithm visits with high probability; the result is that uniform convergence is salvageable but only via algorithm-dependent restrictions, which moves the framework closer to PAC-Bayes in spirit. Together with (Zhang et al. 2017)'s random-label baseline38, the impossibility line is the negative half of modern generalization theory: a definitive falsification of class-capacity-only programs and a forcing function toward algorithm-dependent analyses (PAC-Bayes, MI, sharpness).

The leaderboard separates four eras: foundational PAC-Bayes (1999-2007), the deep-learning challenge (2015-2017, when capacity-based bounds were shown to be vacuous on real architectures and the puzzle of Zhang et al. was sharpened), the bound-engineering era (2017-2021, when SGD-optimized PAC-Bayes and data-dependent priors finally produced fielded non-vacuous numbers on small models), and the modern audit-and-impossibility era (2019-2024, dominated by the Jiang sweep and the Nagarajan-Kolter obstruction). Three readings. First, the only fielded non-vacuous bounds on real classifiers are PAC-Bayes-data-dep variants and only on MLPs / small CNNs at most; every classical norm-product or margin bound exceeds 1 on AlexNet-scale and beyond. Second, the most predictive measures of the empirical generalization gap (by Kendall τ\tau) are sharpness-magnitude hybrids, not the norm-products that dominate the theory. Third, the impossibility result of Nagarajan-Kolter shows that the entire class-capacity-only program cannot in principle deliver tight bounds for SGD-trained classifiers; the path forward is algorithm-dependent, in line with PAC-Bayes and MI but in contrast with the spectral-margin tradition.

Neural Tangent Kernel and Infinite Width

Part III's papers sit at the intersection of three older traditions. The kernel-methods lineage (Schölkopf-Smola, Rasmussen-Williams) supplies the mathematical apparatus (RKHS, positive-definite kernels, kernel ridge regression) that the NTK slots into. The Bayesian-neural-network lineage (MacKay 1992, Neal 1996, Graves 2011, Blundell et al. 2015) supplies the NNGP prior and the variational-inference toolkit that extends to finite-width approximations. The statistical-physics-of-learning lineage (Sompolinsky-Crisanti-Sommers 1988, Seung-Sompolinsky-Tishby 1992, Saad-Solla 1995) supplies the mean-field/replica toolkit that underpins (Mei, Montanari, and Nguyen 2018) and the effective theory of (Roberts, Yaida, and Hanin 2022).

Within Part III's own papers, the intellectual dependency graph runs as follows. (Neal 1996) sits at the root; it establishes the infinite-width-as-GP principle for single-layer networks. (Lee et al. 2018) and (Matthews et al. 2018) extend this to deep networks, giving the NNGP recursion (eq:nngp-recursion). (Jacot, Gabriel, and Hongler 2018) then introduces the tangent-kernel counterpart that governs dynamics rather than priors, and (Lee et al. 2019) tightens the convergence to a uniform-linearization theorem. (Arora et al. 2019) operationalizes all of this by computing the NTK exactly at CIFAR-10 scale. The critique then emerges in three papers: (Chizat, Oyallon, and Bach 2019) diagnoses the NTK regime as lazy via the α\alpha-scaling and demonstrates that feature-learning (rich) regimes exist; (Mei, Montanari, and Nguyen 2018) gives the mean-field PDE as an explicit feature-learning limit; (Yang and Hu 2021) unifies these under the abc-parameterization and identifies μ\muP as the canonical rich-regime limit. Finally, (Hanin and Nica 2019) and (Roberts, Yaida, and Hanin 2022) quantify finite-width corrections as a 1/n1/n-expansion, and (Geiger et al. 2020) and (Woodworth et al. 2020) provide empirical and mathematical evidence that real networks are not purely in the NTK regime.

Part III's place in the statistical-DL story

Part III sits in a specific position relative to the other parts of the book. Part I established classical capacity-based generalization theory and its failure on over-parameterized networks. Part II introduced PAC-Bayes as a more refined framework that handles over-parameterization via posterior-averaging. Part III introduces the NTK as a third distinct path: not a generalization bound but a generalization model in which training dynamics are explicit and closed-form.

The three approaches are complementary. Classical bounds (Part I) are architecture-agnostic but loose; PAC-Bayes bounds (Part II) are architecture-aware and tight with good priors; NTK analysis (Part III) is architecture-specific and gives exact predictions for wide-network training. A complete theory of deep-network generalization would combine these: PAC-Bayes analysis with NTK-GP priors is a natural synthesis that has been pursued (Neyshabur, Bhojanapalli, and Srebro 2018) with some success.

Parts IV and V extend the narrative. Part IV examines benign overfitting and double descent, phenomena that the NTK can describe in its linearized form but that require finite-width corrections to fully explain. Part V examines scaling laws, for which the NTK provides a mathematical template (eq:kernel-scaling-law) but not a complete mechanism.

Collectively the five parts frame a research program: statistical-learning-theoretic explanation of deep learning proceeds by (i) ruling out classical bounds as sufficient (Parts I, II), (ii) establishing the kernel-regression baseline (Part III), (iii) characterizing the gap between this baseline and practice (Parts III, IV, V). No single part delivers a complete theory; together they outline the shape one would take.

Intellectual lineages beyond Part III

A useful orienting frame is to place Part III's papers in the broader intellectual genealogy of the theory of neural networks.

Kernel methods (Aronszajn 1950; (Schölkopf and Smola 2002; Rasmussen and Williams 2006)). The RKHS framework that the NTK slots into was developed in mid-twentieth-century analysis for reasons having nothing to do with neural networks. The NTK's contribution is to identify a specific family of kernels (those arising from neural-architecture parameter-gradients) that inherit the architectural inductive bias and are computationally tractable.

Statistical mechanics of learning (Sompolinsky-Crisanti-Sommers 1988; Seung-Sompolinsky-Tishby 1992; Saad-Solla 1995). The physics tradition of analyzing learning as a statistical-mechanical system, with temperature corresponding to training noise and magnetization corresponding to generalization, underlies the mean-field analysis of (Mei, Montanari, and Nguyen 2018) and the effective-theory program of (Roberts, Yaida, and Hanin 2022). The Wasserstein-gradient-flow reformulation of (eq:mean-field-pde) is more recent and owes to optimal-transport theory (Ambrosio, Gigli, and Savare 2005).

Bayesian neural networks (MacKay 1992; Neal 1996; Graves 2011). The NNGP sits at the infinite-width end of this lineage. Finite-width Bayesian neural networks remain an active research area with variational and Monte Carlo methods; the NNGP provides an analytic benchmark against which finite-width approximations can be measured.

High-dimensional probability and random-matrix theory (Wigner 1955; Marchenko-Pastur 1967; recent kernel-random-matrix literature). The analysis of kernel Gram matrices, their spectra, and their double-descent behavior leans on random-matrix theory. The NTK Gram matrix at a random training set has spectral properties determined by random-matrix ensembles, which connects to the benign-overfitting / double-descent analyses of Part IV.

Tangent methods in optimization (Amari 1998 natural gradient; Martens-Grosse 2015 K-FAC). The NTK is mathematically a specific empirical-Fisher-like object; natural-gradient methods that precondition by the Fisher information are close relatives. The theoretical insight of the NTK is that the linear-ODE structure emerges from the infinite-width limit, not the preconditioning itself.

Gradient-flow analysis of neural networks (Du-Lee-Li-Wang-Zhai 2019; Chizat-Bach 2019). A parallel line of work analyzes gradient-flow convergence on specific neural architectures (two-layer ReLU, deep linear networks) without necessarily taking width to infinity. The NTK provides the infinite-width limit point of this line; the finite-width analyses give convergence rates for specific architectures that the NTK subsumes asymptotically.

Cross-references into other parts

Cross-paper SoTA positioning

How do the papers of Part III position against each other in the SoTA landscape? A few observations:

(Jacot, Gabriel, and Hongler 2018) established the theoretical framework but did not push benchmark numbers. Its empirical section was limited to small synthetic experiments confirming kernel concentration; it did not claim that NTK regression was a competitive predictor.

(Lee et al. 2018) pushed the NNGP benchmark numbers first. It reported NNGP on MNIST matching or exceeding finite networks, which was the first public evidence that an infinite-width closed-form kernel could be competitive on standard benchmarks.

(Arora et al. 2019) pushed the NTK benchmark numbers to CIFAR-10. It delivered the CNTK vs CNN table we have referenced, and established the 5-7% gap as the empirical anchor for subsequent discussions.

(Lee et al. 2019) closed the theoretical loop by proving uniform linearization, which justifies treating infinite-width SGD-trained networks as exactly equivalent to kernel regression.

(Chizat, Oyallon, and Bach 2019) introduced the scaling-based diagnosis of the NTK regime, revealing that lazy training is a universal phenomenon rather than a neural-network-specific one.

(Mei, Montanari, and Nguyen 2018) and (Yang and Hu 2021) offered feature-learning alternatives that explain phenomena the NTK cannot. (Yang and Hu 2021) is particularly influential in practice because μ\muP-transfer is directly usable.

(Hanin and Nica 2019) and (Roberts, Yaida, and Hanin 2022) quantified finite-width corrections, providing a bridge between the idealized infinite-width theory and finite-network reality.

(Geiger et al. 2020) and (Woodworth et al. 2020) provided empirical and mathematical evidence that real-network regimes lie strictly between the lazy and rich extremes.

The collective arc is from idealization (NTK as exact theory of wide networks) through critique (empirical gap is real) to synthesis (finite-width corrections and feature-learning parameterizations together give a richer picture). Each paper in Part III plays a specific role in this arc, and none is self-sufficient.

Connecting to scaling laws (Part V preview)

Part V treats neural scaling laws as a distinct topic. A brief preview of the NTK connection is useful here. The observation that test error scales as a power law in compute, data, and parameters (Kaplan et al. 2020; Hoffmann et al. 2022) begs the theoretical question of why power laws. The NTK provides a partial answer (Maloney, Roberts, and Sully 2022): for kernel regression with a kernel whose eigenvalues decay as λkkα\lambda_k \sim k^{-\alpha} (power law with decay rate α\alpha), the test error on a target with coefficients βkkβ\beta_k \sim k^{-\beta} scales as

𝔼[testerror]ntrainmin(α,2β1)/α+regularizationterms,\begin{equation} \mathbb{E}[\mathrm{test\ error}] \;\sim\; n_{\mathrm{train}}^{-\min(\alpha, 2\beta - 1)/\alpha} + \mathrm{regularization\ terms}, \end{equation}

which is a power law in training-set size. The exponent depends on the kernel's spectral decay (architecture-dependent) and the target's smoothness (task-dependent). For natural datasets whose eigenvalue spectra can be estimated, this predicts scaling exponents that match observed ones reasonably well.

The limitations of this NTK-based scaling story are: it assumes the network is in the NTK regime (not feature-learning); it predicts a single power-law exponent per (architecture, task) pair, whereas observed scaling laws are often better fit by broken power laws with regime transitions; it has no mechanism for the emergent abilities reported in large language models. So the NTK gives a mathematical template for scaling-law derivation but does not account for the full empirical picture.

SoTA table across NTK vs modern architectures

Leaderboard-style comparison: NTK-regression and modern architectures on CIFAR-10 and MNIST. Numbers compiled from (Arora et al. 2019) (CNTK entries), (Lee et al. 2018) (NNGP entries), and standard benchmark repositories. The NTK/NNGP entries represent infinite-width kernel regression with no training and no data augmentation. The WideResNet entry uses Cutout augmentation; the ViT entry uses ImageNet-21k pretraining. The point of the table is the magnitude of the gap: modern practice exceeds NTK by 19-23 points on CIFAR-10, and no finite architecture is within 10 points of the NTK ceiling.
Method CIFAR-10 (%) MNIST (%) Notes
NNGP (FC, depth 10) (Lee et al. 2018) \approx 54.0 \approx 98.7 no training; prior-predictive
CNTK-V (depth 4) (Arora et al. 2019) 66.1 - no pooling, no training
CNTK-GAP (depth 4) (Arora et al. 2019) 75.5 - global-avg pool; infinite-width
CNTK-GAP (depth 21) (Arora et al. 2019) 77.4 - ceiling across depths
CNN-V (depth 11) (Arora et al. 2019) 68.4 - same architecture, SGD-trained
CNN-GAP (depth 21) (Arora et al. 2019) 83.0 - same architecture, SGD-trained
ResNet-34 (Arora et al. 2019) 96.0 > 99.7 skip connections + BN
WideResNet-28-10 (Cutout) 97.1 > 99.7 +augmentation; post-2018
ViT-L/16 (IN21k pretrain) 99.4 > 99.9 +pretraining; post-2020

The generalization of this table is: infinite-width kernel methods are competitive with non-pretrained, non-augmented, unregularized finite networks of matching architecture, but fall far short of modern training recipes that exploit feature learning, data augmentation, and transfer learning. The NTK theory explains one regime (wide, lazily trained, from scratch) and does not explain the regime where modern accuracy lives.

Cross-architecture comparison of the gap

Cross-architecture comparison of NTK-regression vs finite-width SGD-trained accuracy gap on CIFAR-10. Rows are architecture families. The "NTK accuracy" column is the best reported infinite-width kernel-regression result; the "SGD accuracy" column is the best reported finite-width trained result on the same architecture and training-data split (no augmentation, no transfer). Numbers are approximate and drawn from (Arora et al. 2019; Lee et al. 2018, 2020) and successor benchmarking efforts.
Architecture NTK accuracy SGD accuracy Gap
FC depth-3 54.0 55.0 1.0
FC depth-10 54.5 56.0 1.5
CNN-V depth-11 65.9 68.4 2.5
CNN-GAP depth-6 77.4 78.5 1.1
CNN-GAP depth-21 77.4 83.0 5.6
ResNet-like depth-20 \approx79 \approx87 \approx8
Transformer (vision) \approx73 \approx85 \approx12

The pattern is clear: gap grows with architectural sophistication. Fully-connected networks have essentially no gap; basic CNNs have small gaps; modern architectures (ResNets, ViTs) have large gaps. This is the empirical shape of feature learning's contribution.

The persistence of the gap across training recipes

A natural question is whether the NTK-to-SGD gap can be closed by adjusting training recipe. Several attempts have been made, with mixed results:

Training for longer: extending SGD training does not close the gap; SGD converges to a feature-learning predictor, and continued training does not revert to kernel regression.

Using larger widths: increasing nn reduces the kernel fluctuation gap but does not reduce the feature-learning gap; infinite-width μ\muP networks still have feature-learning benefits that NTK lacks.

Using regularization: explicit regularization (weight decay, dropout) can improve generalization but does not close the gap; both NTK and SGD-trained networks benefit, and the SGD-trained network still wins.

Using better initialization: initialization schemes that keep networks closer to the NTK regime (smaller variance, or effective α\alpha) move the SGD-trained network toward the NTK predictor but at the cost of worse final performance.

The persistence of the gap suggests it is structural: SGD-trained finite networks operate in a feature-learning regime that is inherently better than kernel regression for sufficiently complex tasks, and no adjustment of training recipe can move them into the pure kernel regime without sacrificing the performance benefit. The gap is a feature of practical training, not a bug in the NTK approximation.

Gap decomposition by architecture family

Different architecture families exhibit different NTK-to-SGD gaps. A rough taxonomy:

Fully-connected networks. The NTK gap is small for FC networks at moderate width. On MNIST with a 3-layer FC network, the infinite-width NNGP reaches 98.7%, the NTK-regression predictor reaches 99.0%, and an SGD-trained finite network reaches 99.1%. The gap is within sampling noise. The NTK theory is a quantitative theory of wide fully-connected networks.

Vanilla CNNs (no skip connections, no batch norm). The NTK gap is intermediate. On CIFAR-10, the gap between CNTK-GAP and the matching CNN-GAP is 5-7% in the depth range 11-21 (Table above). This is where the feature-learning story becomes essential.

Modern CNNs (ResNets, WideResNets). The NTK gap is large. Analogs of CNTK for ResNets have been proposed (Huang and Yau 2020) but do not capture the training-recipe benefits (augmentation, label smoothing, mixup, schedules) that contribute substantially to modern accuracy. A precise NTK-vs-SGD comparison on ResNets is not easy to set up fairly, but order-of-magnitude estimates place the gap at 10-15% on CIFAR-10.

Transformers. The NTK gap is unclear. Attention-NTKs (Hron et al. 2020) are computable but have not been benchmarked extensively against SGD-trained Transformers; the order of limits matters (dkd_k \to \infty before HH \to \infty, or the kernel degenerates), and the resulting NTK has no closed form because the softmax requires a double Gaussian integral evaluated by Hermite quadrature. On a single-layer infinite-attention experiment with IMDB sentiment, NNGP achieves 90.1%90.1\%, NTK 90.4%90.4\%, and finite-width attention 89.8%89.8\% (Hron et al. 2020), so for that controlled setting infinite-width predictions match finite networks within 1%. Skip connections are essential: without residuals the attention NNGP collapses to a constant as LL \to \infty. Layer-norm reduces in the infinite-width limit to a covariance normalization ΣΣ/tr(Σ)\Sigma \mapsto \Sigma/\mathrm{tr}(\Sigma). At larger scale, preliminary experiments suggest gaps comparable to modern CNNs (10-15% on language-modeling metrics), though the comparison is complicated by the pretraining-and-fine-tuning structure of Transformer workflows.

Graph networks. Graph NTKs (Du et al. 2019) are computable for message-passing architectures and on graph-classification benchmarks the kernel often matches or exceeds SGD-trained GNNs: on MUTAG, GNTK reaches 90.0%90.0\% vs 89.4%89.4\% for the best GIN; on COLLAB, 83.6%83.6\% vs 80.2%80.2\%, and similar inversions hold for 7 of 9 standard benchmarks (PROTEINS, NCI1, IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, REDDIT-5K). The kernel decomposes into node-level kernels Θu,u(G,G)\Theta_{u,u'}(G, G') plus sum-pooling, with the number of message-passing rounds kk entering as a hyperparameter (sweet spot k=2k = 2-33). The cheap, monotone, kernel-favourable picture for graphs may reflect the relatively shallow (2-4 layer) depth of typical GNN architectures, which keeps them in a more NTK-favorable regime; it also suggests feature learning is less critical for graph-classification than for vision, an inversion of the standard NTK-versus-feature-learning narrative.

The overall pattern is that simpler, shallower, more homogeneous architectures are better-described by the NTK, and more sophisticated architectures exhibit larger gaps. This is consistent with the diagnosis that the NTK-vs-SGD gap tracks the degree to which training benefits from feature learning, which in turn tracks architectural sophistication.

NTK's contribution to the theory of generalization

The NTK's principal contribution to generalization theory is to provide a computable baseline against which to measure. Before the NTK, generalization analyses of deep networks were either very loose (VC dimension, Rademacher complexity) or required strong unverified assumptions (PAC-Bayes with specific posterior priors). The NTK changed the landscape by delivering:

  1. An explicit upper bound on capacity: the NTK's RKHS norm ball of radius RR has Rademacher complexity 𝒪(Rtr(Θ(X,X))/ntrain)\mathcal{O}(R \cdot \sqrt{\mathrm{tr}(\Theta_\infty(X,X))} / n_{\mathrm{train}}), which is computable and non-vacuous for wide networks.
  2. A direct generalization bound: for functions in the NTK RKHS with bounded norm, test error is bounded by training error plus the Rademacher term, giving non-vacuous bounds of the form "test error \le training error + 𝒪(R/ntrain)\mathcal{O}(R / \sqrt{n_{\mathrm{train}}})".
  3. A spectral explanation: components of yy^\star aligned with large eigenvalues of Θ(X,X)\Theta_\infty(X,X) are fit easily and generalize well; components in small eigenvalues generalize poorly. This makes spectral decay a quantitative predictor of generalization.

For networks demonstrably in the NTK regime (very wide, NTK-parameterized, trained without aggressive recipes), these bounds are tight: predicted generalization matches observed generalization within small factors. For networks outside the NTK regime, the bounds are conservative: they upper-bound the kernel-regression generalization, which is worse than the feature-learning generalization of the actually-trained network.

This is the pattern that justifies the framing "NTK as baseline": wherever the NTK bound is tight, deep learning is kernel learning; wherever the NTK bound is loose, the slack is attributable to feature learning. The NTK is therefore a tool for diagnosing when feature learning matters, by comparison of NTK-predicted-test-error to actual test error.

Critiques and counter-evidence

Several authors have pushed back on the NTK-as-theory-of-deep-learning framing with evidence that finite networks behave in ways the NTK cannot predict.

Shift-invariance and data augmentation. (Li et al. 2019) showed that data augmentation is not equivalent to any kernel, and hence the benefit of augmentation is invisible to NTK regression. Yet augmentation adds 2-5% test accuracy on CIFAR-10 in practical training. This rules out any argument that the NTK plus appropriate data processing recovers CNN accuracy; some benefit of training must come from outside the kernel-regression framework.

Catastrophic forgetting and transfer learning. A trained network can be fine-tuned on a new task with transfer of learned features. The NTK has no mechanism for this: the kernel is fixed at initialization, so fine-tuning a kernel-regression predictor on a new task is just re-running kernel regression with the same kernel, which cannot leverage prior training. The empirical fact of successful transfer learning (BERT, GPT, CLIP, etc.) is therefore evidence against the NTK as a complete theory of deep-network training.

Adversarial robustness. (Tsipras et al. 2019) and subsequent work show that adversarial-robustly trained networks have different feature representations from naturally-trained ones, with robust features often being human-interpretable. This feature-selection process is a rich-regime phenomenon; adversarial robustness cannot emerge from kernel regression on a fixed kernel.

Emergent abilities. The observation that large language models exhibit qualitatively new capabilities (e.g., in-context learning, chain-of-thought reasoning) at threshold scales has no NTK explanation. The NTK predicts smooth, monotone, power-law-governed performance scaling, not phase-transition-like emergence. Even granting that emergence may be partly a measurement artifact (Schaeffer, Miranda, and Koyejo 2023), the NTK-predicted behavior is insufficient to rationalize large-model practice.

Grokking. (Power et al. 2022) report that networks trained long past fitting the training data eventually suddenly generalize, with test accuracy rising from chance to 100% over a narrow training window. This is incompatible with the NTK's exponential-decay training dynamics (eq:ntk-residual-flow), which predicts that test accuracy should track training loss monotonically.

Taken together, these phenomena establish that something beyond kernel regression is happening in practical deep-learning training. The NTK's residual value is as a null model: behaviors the NTK predicts are explicable without appeal to feature learning; behaviors it does not are evidence for feature-learning mechanisms.

The role of scale in deep-learning theory

A recurring theme in Part III is that scale is not merely a practical scaling axis but a conceptual one: different scales correspond to different theoretical regimes. The NTK emerges as width goes to infinity; the mean-field regime as width goes to infinity with a different parameterization; double descent as training-set size approaches the critical threshold; emergent abilities as compute or parameters cross task-specific thresholds.

The theoretical challenge is that each regime is hard on its own terms, and transitions between regimes are hard to characterize. The NTK regime is tractable but narrow; the mean-field regime is tractable for simple architectures; the μ\muP regime is tractable via Tensor Programs but computationally heavy. None of them overlap with the practical regime of deep-learning training, which involves finite but large networks with complex architectures and training recipes.

A concrete manifestation: the regime analysis of Chapter 3 says that practical networks with α\alpha in an intermediate range are in an interpolating regime where neither the lazy nor the rich theory directly applies. The description of this interpolating regime is an active research direction. Candidate approaches include:

The conceptual shift Part III embodies is from "deep learning is a tricky optimization problem" to "deep learning is a multi-regime phenomenon whose regimes have different governing equations". This shift changes what a "theory of deep learning" should look like: not a single theorem, but a family of theorems for different regimes plus a characterization of when each applies.

Benign Overfitting and Double Descent

The "benign overfitting" research programme crystallized between 2017 and 2020 around three stubborn empirical facts that uniform-convergence theory could not accommodate. First, deep networks fit random labels to zero training error while still generalizing on the true labels (Zhang et al. 2017) (and even when they do generalise on natural inputs they remain catastrophically fragile to imperceptible adversarial perturbations (Szegedy et al. 2014); the two facts together say that the inductive bias selecting a generalising interpolator is also selecting a non-robust one, which any capacity-only account cannot adjudicate). Second, their test error does not increase monotonically with capacity past the interpolation threshold; it decreases again, tracing the double-descent curve (Belkin et al. 2019; Nakkiran et al. 2020). Third, the same phenomenology is visible in linear regression, kernel regression, and random-feature regression, which admit exact asymptotic analysis (Bartlett et al. 2020; Mei and Montanari 2022; Hastie et al. 2022). These three facts together ruled out any explanation based purely on function-class capacity and forced the field to look at the interaction between the data covariance spectrum, the optimizer's implicit bias, and the architectural parameterization.

The lineage of the interpolation idea predates deep learning. (Schapire et al. 1998) already explained why boosting keeps improving test error after training error hits zero (the margin story, see Part I). The kernel community observed the same behaviour when interpolating RBF kernels with zero regularization, documented systematically by (Belkin, Ma, and Mandal 2018): kernels with sufficiently flat spectra can interpolate noisy data and still generalize, contradicting the bias-variance orthodoxy that interpolation must overfit. What deep learning added was scale: models routinely exceed sample size by orders of magnitude, pushing into a regime where the classical analyses simply do not apply. (Belkin et al. 2019) was the synthesis paper that named the phenomenon ("double descent") and drew the canonical figure; a U-curve joined to a second descent, with a spike at the interpolation threshold.

The theoretical settlement came from two directions. (Bartlett et al. 2020) gave finite-sample risk bounds for the min-norm linear interpolator and identified the effective-rank conditions that govern when interpolation is benign; (Tsigler and Bartlett 2023) extended this to ridge regression with both positive and negative regularization. Independently, (Liang and Rakhlin 2020) proved a matching statement for kernel ridgeless regression in the high-dimensional setting, relating the implicit regularization to the curvature of the kernel. On the optimization side, (Soudry et al. 2018) showed that gradient descent on separable data with logistic-type losses converges to the max-margin (L2 SVM) predictor; the implicit bias result that provided the missing algorithmic explanation. (Gunasekar et al. 2018) extended this to convolutional networks (different implicit bias: Fourier-sparse, not L2), and (Ji and Telgarsky 2019) closed the nonseparable case.

The final piece was the random-features bridge. (Rahimi and Recht 2008) introduced random features as a tractable kernel approximation; (Mei and Montanari 2022) used the random-features model as an analytically solvable proxy for a two-layer neural network and derived the precise asymptotic curves that match Belkin and Nakkiran's empirical double-descent plots. (Hastie et al. 2022) independently analyzed high-dimensional ridgeless least squares and also reproduced double descent. (Advani, Saxe, and Sompolinsky 2020) had earlier given the dynamical picture from statistical physics, explaining double descent as an interaction between signal and noise modes during training.

A parallel empirical thread sharpens what role overparameterization actually plays. The Lottery Ticket Hypothesis (Frankle and Carbin 2019) (ICLR 2019 Best Paper) demonstrated that dense networks contain sparse subnetworks ("winning tickets") that, when extracted and reset to their original random initialization and retrained in isolation, match the dense network's accuracy at roughly 10-20% of the original weight count. The construction is iterative magnitude pruning (IMP): train the full network to convergence, prune the lowest-magnitude pp% of weights, reset surviving weights to their exact values from before any training (not to zero, not to a new random draw; back to θ0\theta_0), then retrain the sparse network. Repeating this over multiple rounds finds progressively sparser winning tickets.

On small networks the result is striking. A Lenet winning ticket at 21.1% sparsity reaches minimum validation loss 38% faster than the full network and finishes 0.3 percentage points higher. On CIFAR-10 convolutional networks, the winning ticket at ~11% sparsity improves accuracy by 3.5 points over the dense baseline when matched for iteration count.

The core claim is not about expressivity but about initialization quality: the dense network's parameter budget functions as a search space for discovering an initialization that makes sparse training efficient. The winning ticket is not sparser-but-equivalent; it trains faster than the full network, which a capacity argument cannot explain.

The scale limitation of the original paper is concrete: IMP fails on VGG-19 and ResNet-18 at standard learning rates (0.1), because large-network training is sensitive to learning-rate warmup in ways small-network training is not. Subsequent work (frankle_linear_2020?) addressed this by rewinding surviving weights not all the way to θ0\theta_0 but to a checkpoint taken after a few hundred steps (the "stability point"), recovering the winning-ticket property at ImageNet scale. This refinement is not in the 2019 paper. The skeptical reading that connects Lottery Ticket to double descent: double descent says capacity past interpolation helps generalization; Lottery Ticket says the mechanism is finding a sparse trainable initialization, not deploying all the parameters at inference. This unified reading is not yet a theorem, but it is the cleanest framing for why both phenomena coexist without contradicting each other.

SoTA leaderboard for the interpolation regime across modern benchmarks. All rows are in the over-parameterized regime (parameters much greater than training examples); the point of the table is that train error is near zero while test error continues to improve, empirically refuting the naive bias-variance bound. Numbers are representative headline values reported in the cited papers and follow-ups.
Model family Setting Train acc / err Test acc / err Parameters
Inception, CIFAR-10 true labels (Zhang et al. 2017) SGD to convergence, over-parameterized 100.0 % 89.1 % test acc \approx1.6 M
Inception, CIFAR-10 random labels (Zhang et al. 2017) Same recipe 100.0 % 10 % (chance) \approx1.6 M
AlexNet, ImageNet true labels (Zhang et al. 2017) Standard \approx100 % top-5 \approx83 % top-5 \approx60 M
ResNet-18, CIFAR-10 15% noise (Nakkiran et al. 2020) Width-parameter 64, epoch 4000 100 % \approx 6.5 % test error \approx11 M
ResNet-18, CIFAR-10 15% noise (Nakkiran et al. 2020) Width-parameter 10, at EMC threshold 100 % \approx18 % test error (peak) \approx0.3 M
Standard Transformer, WMT'14 en-fr (Nakkiran et al. 2020) Epoch-wise double descent, 18 layers near 0 train loss +0.9 BLEU over classical base
Min-norm linear interpolator (Bartlett et al. 2020) Gaussian features, p/nγp/n \to \gamma 0 train err benign iff eff-rank cond pnp \gg n
RF regression, two-layer ReLU (Mei and Montanari 2022) n=N=dn = N = d, ψ1=ψ2=1\psi_1 = \psi_2 = 1 0 train err double-descent peak N=d=nN=d=n
Laplace kernel, MNIST/CIFAR-2 (Belkin, Ma, and Mandal 2018) Ridgeless, zero training error 0 train err within 1 % of regularized kernel $n× n$

Scaling Laws and Information-Theoretic Views

The papers collected in this part span two otherwise disjoint literatures: empirical scaling (Kaplan, Chinchilla, Hestness, Bahri) and information-theoretic dynamics (Tishby, Shwartz-Ziv, Saxe), connected by the question of whether deep networks' generalization curves admit a compact functional description. A third, more recent strand (Wei, Schaeffer, Power) asks whether the smooth power laws tell the whole story or whether discontinuous behaviours (emergence, grokking) live on top of them. A fourth strand reframes the bottleneck from quantity to quality. (Sorscher et al. 2022) (NeurIPS 2022 Outstanding) argue that the Hestness-Kaplan power law LDαDL \propto D^{-\alpha_D} is not a property of the learning problem but an artefact of i.i.d. random subsetting. When training examples are ranked by difficulty and the easiest fraction is discarded, test error decays exponentially in the retained fraction ff in the over-parameterised regime rather than following the 1/D\propto 1/\sqrt{D} rate that classical uniform-convergence theory predicts. The intuition: random subsets always retain both easy and redundant samples; a pruned subset concentrated on hard, near-decision-boundary examples provides more gradient signal per sample, so the model learns the same boundary geometry from a smaller dataset.

Ten difficulty metrics are benchmarked, including EL2N (early-training 2\ell_2 error norm), forgetting scores, memorization, ensemble uncertainty, and margin. Among supervised metrics, EL2N performs best on CIFAR-10: a ResNet-18 trained on the top-50% hardest examples (by EL2N) matches full-dataset accuracy. On ImageNet, training on 80% of examples selected by memorization score matches the 100% baseline; most supervised metrics degrade sharply below 60% retention. A new self-supervised metric is proposed: cosine distance to the nearest k-means cluster centroid in SWaV embeddings, which matches supervised baselines without requiring labels and enables pruning before annotation.

The exponential advantage is conditional. It holds only in the over-parameterised regime (model large enough to fit the retained set) and only when the pruning metric aligns well with the true difficulty ordering (measured by angular alignment θ\theta between the probe gradient and the teacher gradient). When the probe misaligns by more than roughly 10 degrees, a minimum retained fraction fmin(θ)f_{\min}(\theta) exists below which further pruning does not help; the exponential phase ends and the standard power law resumes. The paper therefore does not supersede scaling laws for the frontier regime (where models are maximally over-parameterised and the full pre-training corpus is already curated); it applies most directly to fine-tuning and mid-scale training where aggressive dataset curation is practical.

The shared methodology is to fit a parametric family to train-and-test curves and to ask what the fit exponents mean. In Kaplan's setup, the test cross-entropy loss is modelled as three single-variable power laws plus an irreducible offset; in Chinchilla it is a single joint NN-DD surface; in Hestness it is a dataset-size-only power law; in Bahri the exponent is derived from kernel spectra and manifold dimension. The SoTA landscape across papers is summarised below by the exponents appearing in the three canonical laws

L(N)=(NcN)αN,L(D)=(DcD)αD,L(C)=(CcC)αC.\begin{equation} L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}. \end{equation}
SoTA scaling-law exponents across the papers in this part. Empty cells mark laws not reported by that paper. Kaplan's αN\alpha_N and αD\alpha_D are for transformer LMs trained on WebText2 with non-embedding parameters NN; Chinchilla's exponents are symmetric because the compute-optimal recipe equalises the contributions of NN and DD.
Paper αN\alpha_N αD\alpha_D αC\alpha_C Notes
Kaplan 2020 (Kaplan et al. 2020) 0.076 0.095 0.050 GPT-style LM, WebText2, cross-entropy nats/token
Chinchilla (Hoffmann et al. 2022) 0.34 0.28 - Joint (N,D)(N,D) fit, symmetric exponents
Hestness et al. (Hestness et al. 2017) - 0.07-0.35 - Varies by domain: NMT, LM, vision
Bahri et al. (Bahri et al. 2024) derived derived - α4/d\alpha \sim 4/d or αK1\alpha_K - 1
Sharma & Kaplan (Sharma and Kaplan 2022) derived derived - αD=4/dmanifold\alpha_D = 4/d_{\mathrm{manifold}}
Maloney et al. (Maloney, Roberts, and Sully 2022) analytical analytical analytical Closed form from random-feature model
Michaud et al. (Michaud et al. 2023) derived derived - α=(β1)/β\alpha = (\beta-1)/\beta from Zipf exponent

The IB and emergence rows do not fit the αN\alpha_N/αD\alpha_D/αC\alpha_C schema because those papers report different observables (mutual information trajectories, task-by-task accuracy jumps). They appear later in this chapter under their own headings.

References

Advani, Madhu S., Andrew M. Saxe, and Haim Sompolinsky. 2020. "High-Dimensional Dynamics of Generalization Error in Neural Networks." Neural Networks 132: 428-46.
Alquier, Pierre. 2024. "User-Friendly Introduction to PAC-Bayes Bounds." Foundations and Trends in Machine Learning.
Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savare. 2005. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkhauser.
Anthony, Martin, and Peter L. Bartlett. 1999. Neural Network Learning: Theoretical Foundations. Cambridge University Press.
Arora, Sanjeev, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. 2019. "On Exact Computation with an Infinitely Wide Neural Net." In Advances in Neural Information Processing Systems (NeurIPS).
Bahri, Yasaman, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. 2024. "Explaining Neural Scaling Laws." Proc. National Academy of Sciences.
Bartlett, Peter L., Dylan J. Foster, and Matus J. Telgarsky. 2017. "Spectrally-Normalized Margin Bounds for Neural Networks." In Advances in Neural Information Processing Systems (NeurIPS).
Bartlett, Peter L., Philip M. Long, Gábor Lugosi, and Alexander Tsigler. 2020. "Benign Overfitting in Linear Regression." Proc. National Academy of Sciences 117 (48): 30063-70.
Bartlett, Peter L., and Shahar Mendelson. 2002. "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results." Journal of Machine Learning Research 3: 463-82.
Baum, Eric B., and David Haussler. 1989. "What Size Net Gives Valid Generalization?" Neural Computation 1 (1): 151-60.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." Proc. National Academy of Sciences 116 (32): 15849-54.
Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018. "To Understand Deep Learning We Need to Understand Kernel Learning." In Proc. International Conference on Machine Learning (ICML).
Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. "A Training Algorithm for Optimal Margin Classifiers." In Proc. 5th Annual Workshop on Computational Learning Theory (COLT), 144-52.
Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
Bu, Yuheng, Shaofeng Zou, and Venugopal V. Veeravalli. 2020. "Tightening Mutual Information-Based Bounds on Generalization Error." IEEE Journal on Selected Areas in Information Theory 1 (1): 121-30.
Bubeck, Sébastien, and Mark Sellke. 2021. "A Universal Law of Robustness via Isoperimetry." In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2105.12806.
Catoni, Olivier. 2007. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics.
Chizat, Lenaïc, Edouard Oyallon, and Francis Bach. 2019. "On Lazy Training in Differentiable Programming." In Advances in Neural Information Processing Systems (NeurIPS).
Cortes, Corinna, and Vladimir Vapnik. 1995. "Support-Vector Networks." Machine Learning 20 (3): 273-97.
Du, Simon S., Kangcheng Hou, Ruslan Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. 2019. "Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels." In NeurIPS.
Dziugaite, Gintare Karolina, and Daniel M. Roy. 2017. "Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters Than Training Data." In Proc. Uncertainty in Artificial Intelligence (UAI).
- - - . 2018. "Data-Dependent PAC-Bayes Priors via Differential Privacy." In NeurIPS.
Esposito, Amedeo Roberto, Michael Gastpar, and Ibrahim Issa. 2021. "Generalization Error Bounds via Renyi, f-Divergences and Maximal Leakage." IEEE Transactions on Information Theory 67 (8): 4986-5004.
Frankle, Jonathan, and Michael Carbin. 2019. "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1803.03635.
Freund, Yoav, and Robert E. Schapire. 1997. "A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting." Journal of Computer and System Sciences 55 (1): 119-39.
Geiger, Mario, Stefano Spigler, Arthur Jacot, and Matthieu Wyart. 2020. "Disentangling Feature and Lazy Training in Deep Neural Networks." In Journal of Statistical Mechanics.
Gunasekar, Suriya, Jason D. Lee, Daniel Soudry, and Nathan Srebro. 2018. "Implicit Bias of Gradient Descent on Linear Convolutional Networks." In Advances in Neural Information Processing Systems (NeurIPS).
Haghifam, Mahdi, Jeffrey Negrea, Ashish Khisti, Daniel M. Roy, and Gintare Karolina Dziugaite. 2020. "Sharpened Generalization Bounds Based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms." In NeurIPS.
Hanin, Boris, and Mihai Nica. 2019. "Finite Depth and Width Corrections to the Neural Tangent Kernel." In Proc. International Conference on Learning Representations (ICLR).
Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. 2022. "Surprises in High-Dimensional Ridgeless Least Squares Interpolation." Annals of Statistics 50 (2): 949-86.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second. Springer.
Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. "Deep Learning Scaling Is Predictable, Empirically." arXiv Preprint arXiv:1712.00409.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. "Training Compute-Optimal Large Language Models." In Advances in Neural Information Processing Systems (NeurIPS).
Hron, Jiri, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. 2020. "Infinite Attention: NNGP and NTK for Deep Attention Networks." In ICML.
Huang, Jiaoyang, and Horng-Tzer Yau. 2020. "Dynamics of Deep Neural Networks and Neural Tangent Hierarchy." In ICML.
Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." In Advances in Neural Information Processing Systems (NeurIPS).
Ji, Ziwei, and Matus Telgarsky. 2019. "The Implicit Bias of Gradient Descent on Nonseparable Data." In Proc. Conference on Learning Theory (COLT).
Jiang, Yiding, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. 2020. "Fantastic Generalization Measures and Where to Find Them." In Proc. International Conference on Learning Representations (ICLR).
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. "Scaling Laws for Neural Language Models." arXiv Preprint arXiv:2001.08361.
Lacasse, Alexandre, Francois Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier. 2007. "PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier." In NeurIPS.
Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. "Deep Neural Networks as Gaussian Processes." In Proc. International Conference on Learning Representations (ICLR).
Lee, Jaehoon, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. 2020. "Finite Versus Infinite Neural Networks: An Empirical Study." In NeurIPS.
Lee, Jaehoon, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. 2019. "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent." In Advances in Neural Information Processing Systems (NeurIPS).
Lei, Yunwen, and Yiming Ying. 2020. "Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent." In ICML.
Li, Zhiyuan, Ruosong Wang, Dingli Yu, Simon S. Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. 2019. "Enhanced Convolutional Neural Tangent Kernels." arXiv Preprint arXiv:1911.00809.
Liang, Tengyuan, and Alexander Rakhlin. 2020. "Just Interpolate: Kernel "Ridgeless" Regression Can Generalize." Annals of Statistics 48 (3): 1329-47.
Maloney, Alexander, Daniel A. Roberts, and James Sully. 2022. "A Solvable Model of Neural Scaling Laws." arXiv Preprint arXiv:2210.16859.
Matthews, Alexander G. de G., Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahramani. 2018. "Gaussian Process Behaviour in Wide Deep Neural Networks." In Proc. International Conference on Learning Representations (ICLR).
McAllester, David A. 1999. "PAC-Bayesian Model Averaging." In Proc. 12th Annual Conference on Computational Learning Theory (COLT), 164-70.
Mei, Song, and Andrea Montanari. 2022. "The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve." Communications on Pure and Applied Mathematics 75 (4): 667-766.
Mei, Song, Andrea Montanari, and Phan-Minh Nguyen. 2018. "A Mean Field View of the Landscape of Two-Layer Neural Networks." In Proc. National Academy of Sciences, 115:E7665-71. 33.
Michaud, Eric J., Ziming Liu, Uzay Girit, and Max Tegmark. 2023. "The Quantization Model of Neural Scaling." In Advances in Neural Information Processing Systems (NeurIPS).
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine Learning. Second. MIT Press.
Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/book1.html.
- - - . 2023. Probabilistic Machine Learning: Advanced Topics. MIT Press. https://probml.github.io/pml-book/book2.html.
Nagarajan, Vaishnavh, and J. Zico Kolter. 2019. "Uniform Convergence May Be Unable to Explain Generalization in Deep Learning." In Advances in Neural Information Processing Systems (NeurIPS).
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2020. "Deep Double Descent: Where Bigger Models and More Data Hurt." In Proc. International Conference on Learning Representations (ICLR).
Neal, Radford M. 1996. "Priors for Infinite Networks." University of Toronto.
Negrea, Jeffrey, Gintare Karolina Dziugaite, and Daniel M. Roy. 2020. "In Defense of Uniform Convergence: Generalization via Derandomization with an Application to Interpolating Predictors." In ICML.
Neyshabur, Behnam, Srinadh Bhojanapalli, and Nathan Srebro. 2018. "A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks." In Proc. International Conference on Learning Representations (ICLR).
Neyshabur, Behnam, Ryota Tomioka, and Nathan Srebro. 2015. "Norm-Based Capacity Control in Neural Networks." In Proc. Conference on Learning Theory (COLT).
Pensia, Ankit, Varun Jog, and Po-Ling Loh. 2018. "Generalization Error Bounds for Noisy, Iterative Algorithms." In ISIT.
Perez-Ortiz, Maria, Omar Rivasplata, John Shawe-Taylor, and Csaba Szepesvari. 2021. "Tighter Risk Certificates for Neural Networks." Journal of Machine Learning Research 22: 1-40.
Platt, John C. 1999. "Fast Training of Support Vector Machines Using Sequential Minimal Optimization." Microsoft Research.
Power, Alethea, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. 2022. "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." arXiv Preprint arXiv:2201.02177.
Rahimi, Ali, and Benjamin Recht. 2008. "Random Features for Large-Scale Kernel Machines." In Advances in Neural Information Processing Systems (NeurIPS).
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.
Roberts, Daniel A., Sho Yaida, and Boris Hanin. 2022. The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press.
Russo, Daniel, and James Zou. 2016. "Controlling Bias in Adaptive Data Analysis Using Information Theory." In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS).
Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. 2023. "Are Emergent Abilities of Large Language Models a Mirage?" In Advances in Neural Information Processing Systems (NeurIPS).
Schapire, Robert E., Yoav Freund, Peter Bartlett, and Wee Sun Lee. 1998. "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods." Annals of Statistics 26 (5): 1651-86.
Schölkopf, Bernhard, and Alexander J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Seeger, Matthias. 2002. "PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification." Journal of Machine Learning Research 3: 233-69.
Sharma, Utkarsh, and Jared Kaplan. 2022. "Scaling Laws from the Data Manifold Dimension." Journal of Machine Learning Research 23: 1-34.
Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. 2022. "Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning." In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2206.14486.
Soudry, Daniel, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. 2018. "The Implicit Bias of Gradient Descent on Separable Data." Journal of Machine Learning Research 19: 1-57.
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. "Intriguing Properties of Neural Networks." In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6199.
Tsigler, Alexander, and Peter L. Bartlett. 2023. "Benign Overfitting in Ridge Regression." Journal of Machine Learning Research 24: 1-76.
Tsipras, Dimitris, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. 2019. "Robustness May Be at Odds with Accuracy." In ICLR.
Vaart, Aad W. van der. 1998. Asymptotic Statistics. Cambridge University Press.
Vapnik, Vladimir N. 1995. The Nature of Statistical Learning Theory. Springer.
Vapnik, Vladimir N., and Alexey Ya. Chervonenkis. 1971. "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities." Theory of Probability and Its Applications 16 (2): 264-80.
Woodworth, Blake E., Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. 2020. "Kernel and Rich Regimes in Overparameterized Models." In Proc. Conference on Learning Theory (COLT).
Xu, Aolin, and Maxim Raginsky. 2017. "Information-Theoretic Analysis of Generalization Capability of Learning Algorithms." In Advances in Neural Information Processing Systems (NeurIPS).
Yang, Greg, and Edward J. Hu. 2021. "Feature Learning in Infinite-Width Neural Networks." In Proc. International Conference on Machine Learning (ICML).
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. "Understanding Deep Learning Requires Rethinking Generalization." In Proc. International Conference on Learning Representations (ICLR).

  1. VC bound vacuity for deep nets: capacity proxies (VC dimension, parameter count) overestimate the necessary sample size by orders of magnitude. A network with W107W \approx 10^7 parameters has Baum-Haussler VC of O(WlogW)108O(W \log W) \approx 10^8, implying n109n \gtrsim 10^9 for non-vacuous bounds; ImageNet trains with n106n \approx 10^6 and generalizes anyway.↩︎

  2. Rademacher tightness gap: data-dependent Rademacher complexity (Bartlett and Mendelson 2002) is provably tighter than VC and removes the worst-case dependence on the input distribution, but on over-parameterized networks the empirical Rademacher remains close to its trivial upper bound 2log2/n\sqrt{2 \log 2 / n} and the resulting generalization bound is still vacuous on standard image benchmarks.↩︎

  3. Margin-vs-training-error trade-off: AdaBoost's training error can reach 0 after 5-30 rounds while test error keeps falling for 1000 rounds; the round-count bound predicts overfitting that empirically never occurs. (Schapire et al. 1998) resolved this by bounding test error in terms of the margin distribution, not the round count.↩︎

  4. Kernel design as inductive bias: the choice of kernel (linear, polynomial, RBF, Matern) encodes the hypothesis class via its RKHS, yet there is no principled data-driven selection rule; cross-validation over a small grid is the practical default and conflates kernel choice with bandwidth tuning.↩︎

  5. ERM iid assumption: classical bounds assume the sample is drawn i.i.d. from a fixed distribution; the assumption fails for time-series, online learning, distribution shift, and covariate-shift settings, all of which require either domain-adaptation extensions or different proof techniques.↩︎

  6. Concentration looseness for heavy tails: Hoeffding gives a sub-Gaussian tail under boundedness alone; Bernstein and Bennett tighten this when the variance is small relative to the range. Most empirical loss surfaces in deep learning have heavy-tailed gradients and per-sample-loss values, so the variance term is large and the variance-aware bound is no improvement over Hoeffding.↩︎

  7. Capacity-vs-algorithm coupling: classical theory bounds the worst-case ERM minimizer over a hypothesis class, treating the optimizer as black-box. The actual hypothesis returned by SGD on a non-convex loss is a function of the algorithm's geometry (initialization, learning-rate schedule, batch size, implicit-regularization direction), and the gap between worst-case ERM and SGD-output can be many orders of magnitude in the bound.↩︎

  8. SMO scaling: kernel SVMs solve a quadratic program in the dual with O(N2)O(N^2) Gram-matrix storage and O(N2)O(N^2) to O(N3)O(N^3) compute; SMO (Platt 1999) avoids a general QP solver via pairwise coordinate ascent but the asymptotic scaling persists. Past N105N \approx 10^5 the Gram matrix exceeds typical memory and approximation methods (Nystrom, random features) become mandatory.↩︎

  9. Kernel design as inductive bias: the choice of kernel (linear, polynomial, RBF, Matern) encodes the hypothesis class via its RKHS, yet there is no principled data-driven selection rule; cross-validation over a small grid is the practical default and conflates kernel choice with bandwidth tuning.↩︎

  10. SMO scaling: kernel SVMs solve a quadratic program in the dual with O(N2)O(N^2) Gram-matrix storage and O(N2)O(N^2) to O(N3)O(N^3) compute; SMO (Platt 1999) avoids a general QP solver via pairwise coordinate ascent but the asymptotic scaling persists. Past N105N \approx 10^5 the Gram matrix exceeds typical memory and approximation methods (Nystrom, random features) become mandatory.↩︎

  11. Margin-vs-training-error trade-off: AdaBoost's training error can reach 0 after 5-30 rounds while test error keeps falling for 1000 rounds; the round-count bound predicts overfitting that empirically never occurs. (Schapire et al. 1998) resolved this by bounding test error in terms of the margin distribution, not the round count.↩︎

  12. Margin-vs-training-error trade-off: AdaBoost's training error can reach 0 after 5-30 rounds while test error keeps falling for 1000 rounds; the round-count bound predicts overfitting that empirically never occurs. (Schapire et al. 1998) resolved this by bounding test error in terms of the margin distribution, not the round count.↩︎

  13. ERM iid assumption: classical bounds assume the sample is drawn i.i.d. from a fixed distribution; the assumption fails for time-series, online learning, distribution shift, and covariate-shift settings, all of which require either domain-adaptation extensions or different proof techniques.↩︎

  14. Capacity-vs-algorithm coupling: classical theory bounds the worst-case ERM minimizer over a hypothesis class, treating the optimizer as black-box. The actual hypothesis returned by SGD on a non-convex loss is a function of the algorithm's geometry (initialization, learning-rate schedule, batch size, implicit-regularization direction), and the gap between worst-case ERM and SGD-output can be many orders of magnitude in the bound.↩︎

  15. Random-label baseline: (Zhang et al. 2017)'s randomization test is not a generalization bound but a sharp falsification of capacity-based explanations; if a network can memorize random labels at training error 0, no class-based capacity argument can produce a non-trivial bound on the same architecture for the real-label problem.↩︎

  16. SGD-optimization of the bound: Dziugaite and Roy treat the PAC-Bayes bound as a training loss and optimize a Gaussian posterior with SGD; the recipe gives the only fielded non-vacuous bound (0.161 on T-600 binary MNIST) but does not scale beyond 3-layer MLPs because the KL term grows superlinearly with depth.↩︎

  17. Vacuous bounds on deep nets: KL/n\sqrt{\mathrm{KL}/n} blows up because the KL between a random initialization (the legal prior) and the trained weights (the posterior mean) is huge in deep nets, so the resulting bound exceeds 1 on a [0,1] loss except for very small networks.↩︎

  18. Large KL = small bound trade-off: tightening the bound by shifting the posterior mean toward the prior reduces KL but also raises empirical risk; the global optimum of the bound trades these two terms and on deep nets is dominated by KL except for very small models.↩︎

  19. Prior must be data-independent: any dependence of the prior PP on the training sample SS breaks the McAllester proof; legal data-dependent priors (Dziugaite-Roy 2018, Perez-Ortiz 2021) are constructed on a disjoint split SS', trading a smaller KL for sample-splitting overhead and bookkeeping.↩︎

  20. Spectral-norm product vacuity: W2\prod_\ell \|W_\ell\|2 grows polynomially in depth (each layer contributes a factor σmax\sigma{\max} that is rarely <1<1 for ReLU nets at convergence), so the Bartlett-Foster-Telgarsky bound exceeds 1 beyond AlexNet-scale architectures even when the empirical gap is well-controlled.↩︎

  21. MI bounds tighter but harder to estimate: information-theoretic bounds replace capacity with I(S;W)I(S; W), which is finite for noisy SGD and can in principle be smaller than KL-based PAC-Bayes; estimating I(S;W)I(S; W) in high dimensions requires either a Gaussian-process surrogate, a Langevin assumption, or kernelized lower bounds.↩︎

  22. Conditional MI tightening: (Haghifam et al. 2020) replace I(S;W)I(S; W) by a conditional MI given a "supersample" construction; the conditioning makes the term smaller but shifts the estimation problem to one that is in general no easier.↩︎

  23. Kendall-rank vs absolute value: a generalization measure can have high rank correlation with the true gap (its predictive value across a hyperparameter sweep) while its absolute number is vacuous; conversely a tight bound on a single model says nothing about how the bound moves under perturbation. Reporting both is the modern standard.↩︎

  24. Flatness measure fragility: sharpness as a generalization predictor depends on the parameterization; Dinh et al. showed that pointwise reparameterization can scale sharpness arbitrarily without changing the predictor function, so a "flat" minimum in one parameterization can be sharp in another, defeating naive flatness-as-capacity arguments.↩︎

  25. Uniform-convergence impossibility: Nagarajan and Kolter exhibit a linear-classifier setting in which any two-sided uniform-convergence bound is provably nearly vacuous on the SGD-output distribution, ruling out the entire classical-style program for explaining deep-net generalization.↩︎

  26. Vacuous bounds on deep nets: KL/n\sqrt{\mathrm{KL}/n} blows up because the KL between a random initialization (the legal prior) and the trained weights (the posterior mean) is huge in deep nets, so the resulting bound exceeds 1 on a [0,1] loss except for very small networks.↩︎

  27. Large KL = small bound trade-off: tightening the bound by shifting the posterior mean toward the prior reduces KL but also raises empirical risk; the global optimum of the bound trades these two terms and on deep nets is dominated by KL except for very small models.↩︎

  28. SGD-optimization of the bound: Dziugaite and Roy treat the PAC-Bayes bound as a training loss and optimize a Gaussian posterior with SGD; the recipe gives the only fielded non-vacuous bound (0.161 on T-600 binary MNIST) but does not scale beyond 3-layer MLPs because the KL term grows superlinearly with depth.↩︎

  29. Prior must be data-independent: any dependence of the prior PP on the training sample SS breaks the McAllester proof; legal data-dependent priors (Dziugaite-Roy 2018, Perez-Ortiz 2021) are constructed on a disjoint split SS', trading a smaller KL for sample-splitting overhead and bookkeeping.↩︎

  30. Spectral-norm product vacuity: W2\prod_\ell \|W_\ell\|2 grows polynomially in depth (each layer contributes a factor σmax\sigma{\max} that is rarely <1<1 for ReLU nets at convergence), so the Bartlett-Foster-Telgarsky bound exceeds 1 beyond AlexNet-scale architectures even when the empirical gap is well-controlled.↩︎

  31. MI bounds tighter but harder to estimate: information-theoretic bounds replace capacity with I(S;W)I(S; W), which is finite for noisy SGD and can in principle be smaller than KL-based PAC-Bayes; estimating I(S;W)I(S; W) in high dimensions requires either a Gaussian-process surrogate, a Langevin assumption, or kernelized lower bounds.↩︎

  32. Conditional MI tightening: (Haghifam et al. 2020) replace I(S;W)I(S; W) by a conditional MI given a "supersample" construction; the conditioning makes the term smaller but shifts the estimation problem to one that is in general no easier.↩︎

  33. Kendall-rank vs absolute value: a generalization measure can have high rank correlation with the true gap (its predictive value across a hyperparameter sweep) while its absolute number is vacuous; conversely a tight bound on a single model says nothing about how the bound moves under perturbation. Reporting both is the modern standard.↩︎

  34. Flatness measure fragility: sharpness as a generalization predictor depends on the parameterization; Dinh et al. showed that pointwise reparameterization can scale sharpness arbitrarily without changing the predictor function, so a "flat" minimum in one parameterization can be sharp in another, defeating naive flatness-as-capacity arguments.↩︎

  35. MI bounds tighter but harder to estimate: information-theoretic bounds replace capacity with I(S;W)I(S; W), which is finite for noisy SGD and can in principle be smaller than KL-based PAC-Bayes; estimating I(S;W)I(S; W) in high dimensions requires either a Gaussian-process surrogate, a Langevin assumption, or kernelized lower bounds.↩︎

  36. Flatness measure fragility: sharpness as a generalization predictor depends on the parameterization; Dinh et al. showed that pointwise reparameterization can scale sharpness arbitrarily without changing the predictor function, so a "flat" minimum in one parameterization can be sharp in another, defeating naive flatness-as-capacity arguments.↩︎

  37. Uniform-convergence impossibility: Nagarajan and Kolter exhibit a linear-classifier setting in which any two-sided uniform-convergence bound is provably nearly vacuous on the SGD-output distribution, ruling out the entire classical-style program for explaining deep-net generalization.↩︎

  38. Random-label baseline: (Zhang et al. 2017)'s randomization test is not a generalization bound but a sharp falsification of capacity-based explanations; if a network can memorize random labels at training error 0, no class-based capacity argument can produce a non-trivial bound on the same architecture for the real-label problem.↩︎