← Back to home

Optimization for Machine Learning

Optimal transport, Bayesian optimization, and constrained learning

Bayesian Optimization: Theory and Applications

The foundations of Bayesian optimization trace back to work on efficient global optimization (EGO) (Jones, Schonlau, and Welch 1998), which introduced expected improvement as an acquisition function for response surface optimization. The theoretical framework using Gaussian processes draws heavily from the comprehensive treatment in (Rasmussen and Williams 2006).

Modern tutorials (Brochu, Cora, and Freitas 2010; Frazier 2018) provide accessible introductions to the field, while (Shahriari et al. 2016) offers an extensive review covering methodological advances and applications. The GP-UCB algorithm (Srinivas et al. 2012) established regret bounds connecting Bayesian optimization to the bandit literature.

Practical Bayesian optimization for machine learning hyperparameter tuning was demonstrated by (Snoek, Larochelle, and Adams 2012), showing that careful GP prior design leads to state-of-the-art performance. High-dimensional extensions include random embeddings (Z. Wang et al. 2016), local trust regions (Eriksson et al. 2019), and sparse axis-aligned subspaces (Eriksson and Jankowiak 2021).

Information-theoretic acquisition functions (Hernández-Lobato, Hoffman, and Ghahramani 2014) provide principled approaches to exploration, while portfolio methods (Brochu, Hoffman, and Freitas 2011) combine multiple acquisition strategies. Multi-fidelity methods (Kandasamy et al. 2017; Li et al. 2020; Swersky, Snoek, and Adams 2014) leverage cheap approximations to accelerate optimization. Transfer learning (Feurer et al. 2022; Swersky, Snoek, and Adams 2013) enables knowledge transfer across related optimization tasks.

Scalable Gaussian process inference (Snelson and Ghahramani 2006; Titsias 2009; Matthews et al. 2016) underpins practical BO implementations, with GPyTorch (Gardner et al. 2018) providing efficient GPU-accelerated computation. Applications span neural architecture search (Kandasamy et al. 2018; Falkner, Klein, and Hutter 2018), biological sequence design (Stanton et al. 2022), and safe exploration (Sui et al. 2015).

The following leaderboard summarises the methods this chapter dissects, organised by the regime each one targets. Read the table top-down as exploitation/exploration first, then scalability, then dimensionality, then setting-specific extensions.

Leaderboard of canonical BO methods organised by regime. Surrogate records the inference primitive; Acq. records the acquisition rule; Regime records when the method is the right default; Headline records a representative empirical result. The chapter weaves each row into prose with full citations.
Method Surrogate Acq. Regime Headline result
EGO (Jones, Schonlau, and Welch 1998) Kriging EI low-D smooth deterministic first BO convergence proof, 14 yr predates regret bounds
GP-UCB (Srinivas et al. 2012) exact GP UCB bandit-style with regret bound RT=O*(TβTγT)R_T = O^*(\sqrt{T \beta_T \gamma_T}) sublinear
PES (Hernández-Lobato, Hoffman, and Ghahramani 2014) exact GP + EP PES low-D, very expensive evals 20-30 percent over EI on Branin and Hartmann-6
GP-Hedge (Brochu, Hoffman, and Freitas 2011) exact GP portfolio unknown landscape, kernel uncertainty beats best-individual on 28/32 synthetic benchmarks
Spearmint (Snoek, Larochelle, and Adams 2012) MCMC GP EI per sec ML hyperparameter tuning 14.98 percent CIFAR-10 vs 18 percent hand-tuned
q-EI-SGA (J. Wang et al. 2020) exact GP q-EI parallel batch, q up to 50 7x speedup at q=8q = 8 on ATO simulator
SPGP / FITC (Snelson and Ghahramani 2006) sparse GP any mid-data N104N \sim 10^4 M=50M = 50 inducing points reach 95 percent of full GP
VFE (Titsias 2009) variational GP any mid-data, hyperparameter-safe bound monotone in inducing set
BBMM (GPyTorch) (Gardner et al. 2018) exact GP via CG any large-data N106N \sim 10^6 exact GP on 1M points in 6 min on V100
REMBO (Z. Wang et al. 2016) low-D GP EI low effective dim, deffDd_{\text{eff}}\!\ll\! D 3-5x sample efficiency on 69D MIP-tuning
TuRBO (Eriksson et al. 2019) local GPs TS high-D rough, up to D=400D = 400 beats CMA-ES on 200D Rover Trajectory
SAASBO (Eriksson and Jankowiak 2021) NUTS GP horseshoe TS high-D sparse, expensive evals 388D Mopta in 200 evals vs TuRBO's 500
Add-GP (Duvenaud, Nickisch, and Rasmussen 2011) additive GP EI/UCB low-order interactions dominates SE on 9 of 11 UCI regression sets
BOCA (Kandasamy et al. 2017) MF-GP cost-UCB continuous fidelity 7x speedup on neural-net tuning, 10x on supernova LL
DNN-MFBO (Li et al. 2020) cascaded BNN MES discrete fidelity, deep models 60 vs 200 samples on 8D 3-fidelity FEM
Freeze-thaw (Swersky, Snoek, and Adams 2014) curve-kernel GP per-action ES training-curve fidelity 3x wall-clock vs Spearmint on PMF/MovieLens-10M
Multi-task GP (Swersky, Snoek, and Adams 2013) ICM joint GP cost-ES warm-start from related tasks 10 vs 30 target evals on small-MNIST to MNIST
RGPE (Feurer et al. 2022) source-GP ensemble EI hyperparameter-free transfer 3-5x wall-clock saved across 2086 OpenML tasks
PESC (Hernández-Lobato et al. 2016) constraint GPs PESC active black-box constraints 5x regret gap vs EIC on Hartmann6-C1
SafeOpt (Sui et al. 2015) exact GP + Lipschitz max-var safety-critical online 99.4 percent safe rate on MovieLens-3.5
NASBOT (Kandasamy et al. 2018) OTMANN-kernel GP TS neural architecture search 11.07 percent CIFAR-10 in 200 evals
BOHB (Falkner, Klein, and Hutter 2018) TPE per budget density-ratio Hyperband-augmented HPO 2x BO and 1.5x Hyperband on CartPole PPO
LaMBO (Stanton et al. 2022) latent-space MTGP NEHVI discrete biological sequences 40 percent over GA on GFP at 500 rounds
Policy-search BO (Letham and Bakshy 2019) rank-1 ICM TS online + offline policy search 5+50 vs 25+ online evals at Facebook
FBO (Ha et al. 2019) exact GP EI + box test unknown search bounds 40 vs 200+ evals on log-scale SVM hyperparameters

Constrained Optimization in Machine Learning

Many ML pipelines have a two-stage structure: a predictor outputs unknown parameters; a solver consumes them and returns decisions. Training the predictor with MSE or cross-entropy is agnostic to how prediction errors propagate through the solver, and small parameter errors near a constraint boundary can flip an entire decision12. Constrained-optimization learning (variously decision-focused learning, predict-then-optimize, or differentiable-optimization) closes this loop by differentiating through the solver itself; the predictor learns to minimize regret on decisions, not error on parameters. Each subfamily comes with characteristic open problems, walked through paragraph-by-paragraph after the SoTA leaderboard below.

The historical trajectory has three eras. Early implicit-layer work (2010-2017) established the gradient calculus for graphical-model inference (Domke 2010) and convex-program sensitivity (Barratt 2019); the differentiable-layer era (2017-2021) packaged this calculus into deep-learning frameworks via OptNet (Amos and Kolter 2017) and CvxpyLayers (Agrawal et al. 2019), with the predict-then-optimize formulation (Elmachtoub and Grigas 2022) giving consistent surrogate losses; and the modern era (2021-present) has scaled to combinatorial problems via blackbox finite differences (Vlastelica et al. 2020) and perturbed maximizers (Berthet et al. 2020), to structured prediction via Fenchel-Young losses (Blondel, Martins, and Niculae 2020), and to dual learning (Klamkin, Tanneau, and Hentenryck 2025) for large-scale constrained programs.

The taxonomy below distinguishes seven families. Implicit-layer methods solve the forward problem with a classical solver and back-propagate by differentiating the KKT system; this is exact when the active set is fixed, but the linear system is ill-conditioned at the boundary3 and costly to assemble at every step4. Predict-then-optimize methods replace the MSE training signal with a regret-based surrogate (SPO+, CaVE), eliminating the loss-mismatch5 but inheriting the solver's non-differentiability for combinatorial problems6. Combinatorial methods sidestep non-differentiability via blackbox interpolation, perturb-and-MAP, or Sinkhorn relaxations, paying for the smoothing with an integer-relaxation gap7. Augmented-Lagrangian and prox-based methods enforce constraints by penalty rather than solver call, trading exact feasibility for a tunable violation-vs-optimality balance8. Bilevel methods fold the inner problem into a hyper-gradient, enabling millions of hyperparameters but requiring careful Hessian-vector treatment. Structured-prediction methods cast inference as regularized optimization with closed-form Fenchel-Young losses. Application-driven methods (MPC, portfolio, routing) tune the recipe to a problem class. A persistent diagnostic across the field is scale mismatch: an MLP predictor's natural output scale is rarely the solver's parameter scale, and naive loss combinations across the two scales train poorly9; another is the failure to exploit problem structure10, where a generic implicit-layer beats a problem-specific decomposition only on toy sizes.

Constrained-optimization-learning SoTA leaderboard. Method links to the chapter section and cites the canonical paper. Architecture/Layer = the differentiable mechanism (implicit KKT, prox, perturbed argmax, augmented Lagrangian, etc.). Loss = the training signal driving the predictor. Problem class abbreviates LP / QP / MILP / SOCP / Combinatorial (Comb) / Convex (Cvx). Constraint-sat = constraint-satisfaction rate (% of test instances feasible). Opt-gap = optimality gap (%) vs the exact-solver optimum. Speedup = wall-clock vs the classical solver (1x = parity). OOD = qualitative generalization to out-of-distribution constraints (high / med / low). Dashes mark unreported settings. Numbers are paper-reported headline values; settings differ across rows and only same-row reads are consistent.
Constraints Optimality Speed OOD
Method Year Architecture Loss Problem sat (%) gap (%) speedup gen
Domke implicit grad (Domke 2010) 2010 implicit graphical-model marginal NLL Cvx-MAP 100 - 1x low
Barratt sensitivity (Barratt 2019) 2019 implicit cone-program downstream-task SOCP 100 - 0.3x med
ICNN (Amos, Xu, and Kolter 2017) 2017 input-convex MLP MSE on objective Cvx 100 1.0 5x med
OptNet (Amos and Kolter 2017) 2017 implicit QP layer (KKT) task / MSE QP 100 0.3 0.5x med
DiffMPC (Amos et al. 2018) 2018 implicit iLQR imitation QP-MPC 100 1.5 1x med
SPO+ (Elmachtoub and Grigas 2022) 2017 predict-then-LP regret-based (SPO+) LP 100 4.2 1x high
CvxpyLayers (Agrawal et al. 2019) 2019 implicit DCP cone task Cvx 100 - 0.4x high
GNN-NP (Prates et al. 2019) 2018 message-passing GNN binary CE on feasibility SAT 85 - 30x low
AttnTSP (Kool, Hoof, and Welling 2019) 2019 encoder-decoder attention REINFORCE rollout-baseline Comb 100 2.6 100x low
Bilevel-HPO (Franceschi et al. 2018) 2018 reverse-mode unrolling val loss Cvx - - - -
Lorraine M-HP (Lorraine, Vicol, and Duvenaud 2020) 2019 Neumann implicit val loss Cvx - - 10x high
Meta-DCO (Lee et al. 2019) 2019 implicit ridge meta-loss (CE) QP 100 - 0.6x high
Blackbox-CO (Vlastelica et al. 2020) 2020 finite-diff via solver hinge / Hamming Comb 100 1.4 1x med
Perturbed-FY (Berthet et al. 2020) 2020 Gumbel-perturbed argmax Fenchel-Young Comb 100 2.0 1x high
Fenchel-Young (Blondel, Martins, and Niculae 2020) 2020 regularized prediction Fenchel-Young Comb 100 - 1x high
DiffDP (Mensch and Blondel 2018) 2018 smoothed Bellman structured DP-Comb 100 - 5x med
SATNet (P.-W. Wang et al. 2019) 2019 low-rank SDP (Mixing) MSE on logits SDP-MaxSAT 100 - 20x low
DDN (Gould, Hartley, and Campbell 2021) 2021 declarative implicit task Cvx 100 - 0.5x med
Kotary L2O (Kotary, Fioretto, and Hentenryck 2021) 2021 NN-imitation of solver feas+opt joint LP/QP 92 1.8 50x low
BPQP (Pan et al. 2024) 2024 structured QP backward task QP 100 0.4 5x med
CaVE (Tang and Khalil 2024) 2024 cone-aligned MILP regret cone-loss MILP 100 0.9 1x high
Dual-OL (Klamkin, Tanneau, and Hentenryck 2025) 2025 augmented-Lagrangian dual-objective LP/QP 99 0.7 8x high

The table groups the literature into seven families, walked through in chronological order. Implicit-layer foundations (Domke, Barratt) gave the gradient calculus for parametric MAP and convex programs by differentiating the KKT system; the construction is exact when the active set is fixed but is dominated by the cost of solving an inner linear system at every gradient step11, and is numerically delicate near a constraint boundary12. The 100% constraint-satisfaction reported in the table for these rows reflects that the forward pass is a feasible solver call; the OOD column flags whether the predictor's parameters generalize beyond the training distribution of constraints, which Domke's marginal-NLL recipe does not address while CvxpyLayers and Lorraine's hyperparameter approach do.

The differentiable-optimization-layer line (OptNet, CvxpyLayers, DDN, BPQP) extended the implicit-layer recipe to general QP and DCP problems, packaging the KKT-Jacobian backward into PyTorch and JAX. The wall-clock cost of these layers (0.3-0.5x compared to a forward-only solver call) is dominated by an LU factorization on the KKT matrix at every step13, with BPQP reducing the constant factor by 5-10x via structure-aware factorization but not the asymptotic order; the implicit-layer family also fails to exploit problem-specific decompositions that classical solvers use for free14. Within this line, input-convex networks (Amos, Xu, and Kolter 2017) take an orthogonal route: instead of differentiating a solver, they constrain the network architecture to be convex in its input, so the forward pass is an unconstrained convex optimization with closed-form gradients.

Predict-then-optimize methods (SPO+, CaVE) replace the predictor's MSE loss with a regret-based surrogate that knows the downstream solver. SPO+ is a Fisher-consistent convex upper bound on the true regret, achieved with a single extra solver call per training point and gradients computed without backprop through the solver itself; this sidesteps the non-differentiable-solver obstacle15 but pays for it with a 4.2% optimality gap at convergence (vs OptNet's 0.3% on small QPs) because the surrogate is not the true regret16. CaVE extends the recipe to MILP by aligning the predicted cost vector to the optimality cone of the integer-feasible solution, the first decision-focused method that closes the gap on binary programs without smoothing the integer feasibility17.

Combinatorial methods (Blackbox-CO, Perturbed-FY, Fenchel-Young, GNN-NP, AttnTSP, SATNet) attack the harder problem of differentiating through non-differentiable integer/SAT solvers18. Blackbox differentiation runs the solver twice with perturbed parameters and uses a finite-difference gradient on the resulting decisions; perturbed-FY adds Gumbel noise to the solver inputs and treats the expected argmax as a differentiable smoothed version, with closed-form gradients via the Fenchel-Young machinery. SATNet takes a third route, replacing the solver entirely with a low-rank SDP relaxation differentiable end-to-end. The cost of all three is an integer-relaxation gap19: the relaxed gradient signal is well-defined but biased away from the true integer optimum, and the gap does not vanish as the predictor improves.

Bilevel methods (Franceschi, Lorraine, Lee) fold an inner optimization (model fit on the train set) into an outer hyper-gradient (val loss). Reverse-mode unrolling computes exact hyper-gradients but stores all inner-loop activations, limiting the inner horizon to ~20-50 steps (Franceschi et al. 2018); Neumann-implicit methods (Lorraine, Vicol, and Duvenaud 2020) approximate the hyper-Jacobian as a truncated Neumann series, scaling to millions of hyperparameters at the cost of approximation bias. The OOD column is high for these because the meta-loop is training for OOD generalization by construction; the inner-loop ill-conditioning is the same KKT pathology20 as in the implicit-layer family.

Dual learning (Klamkin 2025) and augmented-Lagrangian methods enforce constraints via a learned dual variable rather than a solver call, training a primal predictor and a dual predictor jointly with an augmented-Lagrangian objective. The reported 99% constraint-satisfaction (vs 100% for solver-in-the-loop methods) is the price of dropping the inner solver from the training loop, recovered as an 8x training speedup; the prox-update on the dual variable is what stabilizes the alternation between primal-feasibility and primal-optimality21. Against the predict-then-optimize family, dual learning trades a small infeasibility rate for a much faster training loop and better scaling to large LP/QP instances.

The table separates four eras: foundational implicit gradients (2010-2017), the differentiable-layer build-out (2017-2021), combinatorial and predict-then-optimize maturation (2018-2024), and the modern dual learning and cone-aligned recipes (2024-2025). Three readings. First, no single method dominates: implicit-layer methods are exact but slow (0.3-0.5x), predict-then-optimize is fast but has a 1-4% optimality floor, combinatorial is fast on integer problems but hurt by relaxation bias, and dual learning is fastest but trades a percent of feasibility. Second, OOD generalization is the unresolved axis: the methods rated "high" (SPO+, CvxpyLayers, Lorraine, Perturbed-FY, FY, Meta-DCO, CaVE, Dual-OL) are those whose training signal is the downstream task, while methods rated "low" (Domke, GNN-NP, AttnTSP, SATNet, Kotary) overfit to the training instance distribution. Third, the trend is toward hybrid recipes: BPQP combines structured backward with task loss, CaVE combines cone alignment with decision-focused regret, and Dual-OL combines augmented-Lagrangian with primal-dual prediction; the next generation appears to be augmented-Lagrangian + decision-focused trained end-to-end on industrial-scale LPs.

Optimal Transport: Theory and Applications in Machine Learning

The Feasibility-vs-Fidelity Tension

One tension drives the entire computational history of optimal transport: exact solutions are provably correct but computationally prohibitive, while tractable approximations each pay a distinct fidelity price. Every algorithmic generation from 1955 to 2024 is best understood as a specific answer to the question "how much fidelity can we afford to trade for the compute budget we actually have?" The five major lineages (classical LP, entropic/Sinkhorn, sliced projection, unbalanced/Gromov-Wasserstein extensions, and neural/flow-matching) converge, in recent work, on one unifying principle: recover a coupling or transport map that minimizes a cost, subject to the marginal and feasibility constraints your budget can enforce.

The table below places every major method on this trade-off surface. Methods appear as responses to the limitations of their predecessors; each section below follows that logic rather than a strict chronology. Textbook coverage is in the Mathematical Foundations section; algorithmic details are in Computational Methods and Extensions.

Villani's two volumes (Villani 2003, 2009) remain the definitive references; Santambrogio (Santambrogio 2014, 2015) is the most compact entry point with a calculus-of-variations emphasis; Ambrosio-Gigli-Savaré (Ambrosio, Gigli, and Savaré 2008) develops the gradient-flow theory; and Peyré-Cuturi (Peyré and Cuturi 2020) is the standard computational reference. Rachev's 1985 monograph (Rachev 1985) introduced the unified "Monge-Kantorovich" nomenclature to the Western probability community. Each reference carries characteristic strengths and well-known limitations222324252627282930, walked through method-by-method after the table.

OT SoTA leaderboard. Method links to the chapter section and cites the canonical paper. Solver = the actual computational kernel (LP, Sinkhorn, projection-1D, ICNN, ODE). Loss = the training signal (transport cost, regularized dual, KL-marginal, displacement-velocity MSE). Problem = the OT family (continuous OT / discrete OT / unbalanced / GW / Schrödinger). Accuracy = relative W22W_2^2 error on the synthetic Gaussian pair 𝒩(0,Id)𝒩(m,Σ)\mathcal{N}(0,I_d) \to \mathcal{N}(m, \Sigma) where the closed-form Bures-Wasserstein gives ground truth (lower is better, smaller ε\varepsilon reduces bias). Compute = wall-clock per 1000-sample pair on a single CPU core (M means matrix-only; ODE means 50 NFE). Scale columns: 1D->100D = whether the same algorithm transfers from low to high dimension without algorithmic surgery; Memory = working-set size for n=10k samples. Dashes mark unreported / not-applicable settings.
Accuracy Compute Scale
Method Year Solver Loss Problem W2 err t/1k 1D->100D Mem
Hungarian / network-simplex 1955 exact-LP linear assignment discrete 0.0% 30 s poor O(n2)
Brenier polar factorization (Brenier 1991) 1991 analytic-gradient - continuous 0.0% - analytic -
Gangbo-McCann geometry (Gangbo and McCann 1996) 1996 analytic c-cyclical-monotonicity continuous 0.0% - analytic -
Benamou-Brenier dynamic (Benamou and Brenier 2000) 2000 ALG2 / ADMM action functional continuous 0.5% 5 min poor O(Nd T)
Gromov-Wasserstein (Mémoli 2011) 2011 quadratic-LP bilinear coupling GW n/a 60 s local-min O(n2)
Sinkhorn-Cuturi (Cuturi 2013) 2013 Sinkhorn(eps=0.01) regularized-cost discrete 5-15% 0.3 s good O(n2)
Cuturi-Doucet barycenters (Cuturi and Doucet 2014) 2014 Sinkhorn-fixed-pt weighted-mean discrete 8% 1 s good O(Kn2)
Stochastic OT (Genevay et al. 2016) 2016 SAG / SGD-dual regularized-dual continuous 6% 2 s good O(n)
Wasserstein GAN (Arjovsky, Chintala, and Bottou 2017) 2017 dual-network KR-Lipschitz-1 continuous n/a - good O(net)
WGAN-GP (Gulrajani et al. 2017) 2017 dual-network KR + grad-penalty continuous n/a - good O(net)
OT domain adaptation (Courty et al. 2017) 2017 Sinkhorn(class) barycentric-mapping discrete 7% 0.5 s good O(n2)
Sinkhorn divergences (Genevay, Peyré, and Cuturi 2018) 2018 Sinkhorn(3 runs) debiased-Weps discrete 1-3% 1 s good O(n2)
Unbalanced OT (Chizat et al. 2018) 2018 scaling(KL-marg) KL-relaxed-marginals unbalanced 4% 0.4 s good O(n2)
APDAGD acceleration (Dvurechensky, Gasnikov, and Kroshnin 2018) 2018 Nesterov-dual regularized-dual discrete 3% 0.2 s good O(n2)
Sliced-Wasserstein (Kolouri et al. 2018) 2018 proj-1D(L=128) mean-1D-W2 continuous 12% 0.04 s excellent O(Ln)
Tree-Sliced-Wasserstein (Le et al. 2019) 2019 tree-projection tree-1D-W continuous 9% 0.05 s excellent O(Ln)
Large-scale neural mapping (Seguy et al. 2018) 2018 dual-net + map semi-dual + reg continuous 5% 1 epoch good O(net)
ICNN saddle-point (Amos, Xu, and Kolter 2017) 2017 ICNN-pair min-max-Brenier continuous 4% hours good O(net)
JKOnet (Bunne et al. 2022) 2022 ICNN-JKO proximal-step gradient-flow n/a 2 GPU-d good O(net)
Neural-OT (Korotin) (Korotin, Selikhanovych, and Burnaev 2023) 2023 min-max-potential dual-saddle continuous 3% hours good O(net)
Sinkformer attention (Sander et al. 2022) 2022 Sinkhorn-attn doubly-stochastic discrete n/a layer good O(n2)
OT flow-matching (rectified flow) 2023 mini-batch-OT + ODE velocity-MSE continuous 2% 1 ODE-step good O(net)
Schrödinger-bridge (DSB) 2023 iterative-IPF KL-path-measure Schrödinger 2% hours good O(net)
Peyré-Cuturi textbook (Peyré and Cuturi 2020) 2020 reference - - - - - -

The table groups the literature into five generations, each a response to the feasibility-vs-fidelity tension. The sub-sections below follow the same logic. The table below summarizes the structural trade-offs of each lineage; individual method numbers are in the SoTA leaderboard above.

Lineage-level feasibility-vs-fidelity properties. Cost = per-problem time complexity in nn samples; Bias = relative error vs.\ exact OT on synthetic Gaussians (from SoTA table above); Differentiable = whether the loss is differentiable end-to-end without manual adjoint derivation; Continuous dist.\ = whether the method applies to continuous (non-discrete) source/target measures; Unequal mass = whether the method handles μ(𝒳)ν(𝒴)\mu(\mathcal{X}) \neq \nu(\mathcal{Y}).
Lineage Cost Bias Differentiable Continuous dist. Unequal mass
Exact LP / analytic O(n3logn)O(n^3 \log n) 0% no (plan is sparse, vertex changes) analytic cases only no
Entropic / Sinkhorn Õ(n2/ε2)\tilde{O}(n^2/\varepsilon^2) 2-15% yes (implicit diff.) via sampling no
Sliced projection O(Lnlogn)O(Ln \log n) 5-12% yes (sorting) yes no
Unbalanced / GW Õ(n2)\tilde{O}(n^2) 3-8% partial partial yes
Neural / flow-matching O(training)O(\text{training}) 2-5% yes (autodiff) yes partial

Generation 1 (1955-2000): Exact LP and Analytic Foundations

Classical exact-LP methods (Hungarian, network simplex) solve the discrete Kantorovich problem to numerical zero error, but the O(n3logn)O(n^3 \log n) cost makes them prohibitive beyond n104n \approx 10^4 source/target points31; this cubic wall is the principal motivation for every subsequent computational development. The same era produced the analytic theory: Brenier's (Brenier 1991) polar factorization established that, for absolutely continuous source measures and squared-Euclidean cost, the optimal map exists, is unique, and equals the gradient of a convex potential; Gangbo-McCann (Gangbo and McCann 1996) generalized this to any strictly convex superlinear cost and identified cc-cyclical monotonicity as the universal characterization of optimal couplings. These foundational results give exact answers in closed form for tractable cases (Gaussians via Bures-Wasserstein, 1D via quantile composition) but provide no algorithm for the high-dimensional empirical regime that ML faces.

The fidelity of exact-LP is maximal (zero relative error) but the feasibility cost is a cubic barrier. Generation 2 resolves this by accepting a tunable bias in exchange for near-linear parallel complexity.

Generation 2 (2013-2018): Entropic Regularization and the Sinkhorn Revolution

Entropic-OT via the Sinkhorn iteration (Cuturi 2013) is the algorithmic pivot of modern computational OT: the addition of a small KL-regularizer εKL(γμν)\varepsilon \mathrm{KL}(\gamma \| \mu \otimes \nu) makes the problem strictly convex, the optimal coupling factors as P*=diag(𝐮)Kdiag(𝐯)P^* = \mathrm{diag}(\mathbf{u}) K \mathrm{diag}(\mathbf{v}), and matrix-scaling iterations converge linearly. The complexity drops from O(n3logn)O(n^3 \log n) to Õ(n2/ε2)\tilde{O}(n^2/\varepsilon^2) (Dvurechensky, Gasnikov, and Kroshnin 2018), with massive parallelism across rows and columns, and the iterations are differentiable through the converged dual potentials; these three properties (speed, parallelism, differentiability) are jointly responsible for OT's penetration into deep learning. The price is the entropic bias: WεWW_\varepsilon \neq W and the entropic plan is "blurred" by an amount controlled by ε\varepsilon32. Sinkhorn divergences Sε=Wε(μ,ν)12Wε(μ,μ)12Wε(ν,ν)S_\varepsilon = W_\varepsilon(\mu,\nu) - \tfrac12 W_\varepsilon(\mu,\mu) - \tfrac12 W_\varepsilon(\nu,\nu) (Genevay, Peyré, and Cuturi 2018) correct this debiasing at the cost of three solver runs; epsilon-scaling (geometric annealing of ε\varepsilon) recovers near-exact OT in roughly an order of magnitude fewer total iterations.

Entropic OT still scales as O(n2)O(n^2) in the cost matrix, which dominates memory and compute when nn is large and the ground cost is expensive to evaluate. Generation 3 sidesteps the cost matrix entirely by projecting to 1D.

Generation 3 (2018-2019): Sliced OT and Dimension-Free Projections

Sliced and projection-based methods (Kolouri et al. 2018; Le et al. 2019) sidestep the O(n2)O(n^2) cost matrix by averaging univariate Wasserstein distances over directions of the unit sphere Sd1S^{d-1}: SWpp(μ,ν)=Sd1Wpp(θ#μ,θ#ν)dθSW_p^p(\mu,\nu) = \int_{S^{d-1}} W_p^p(\theta_\#\mu, \theta_\#\nu)\,d\theta. Each 1D projection sorts in O(nlogn)O(n \log n), so LL projections cost O(Lnlogn)O(Ln \log n); for typical L=50-1000L = 50\text{-}1000 this is two to three orders of magnitude faster than full OT. The dimension-free O(n1/2)O(n^{-1/2}) sample complexity of sliced-Wasserstein (vs the O(n1/d)O(n^{-1/d}) curse of full OT33) is its main statistical advantage and the reason it scales gracefully from 1D to 100D. The trade-off is projection bias34: SW is a true metric but only equivalent to W2W_2 for Gaussian distributions; on data concentrated on low-dimensional manifolds, projection averages over informative-and-uninformative directions equally and underestimates the true distance.

OT family evolution from 1955 exact-LP solvers to 2023 production-grade flow-matching and Schrödinger-bridge dynamics. Marker shape and colour encode family; the dotted line marks the exact-LP ground truth (zero relative error). Three observations match the table reading: (i) the exact-LP and foundations era achieves zero error but does not scale; (ii) the 2013-2018 entropic and sliced cluster trades 5-15% bias for two-to-three-orders-of-magnitude speedup, with Sinkhorn divergences recovering 1-3% accuracy via debiasing; (iii) the 2018-2023 neural-OT and flow line, scaled by min-max ICNN potentials and ODE integration of OT-coupled velocity fields, reaches 2-3% accuracy on continuous problems where direct Sinkhorn is infeasible.

Both sliced and entropic methods still assume the source and target live in the same metric space and carry equal total mass. Many practical problems violate both assumptions. Generation 4 relaxes them.

Generation 4 (2017-2022): Unbalanced OT, Gromov-Wasserstein, and Neural Potentials

Unbalanced OT (Chizat et al. 2018) relaxes the equal-mass marginal constraint of classical OT; required whenever source and target have different totals (single-cell snapshots, partial-shape correspondences, class-imbalanced datasets)35. The relaxation replaces hard marginals with soft KL or TV penalties:

minγ0cdγ+τ1KL(γ1μ)+τ2KL(γ2ν)\begin{equation}\label{eq:uot-rw} \min_{\gamma \geq 0} \int c\,d\gamma + \tau_1 \mathrm{KL}(\gamma_1 \mid \mu) + \tau_2 \mathrm{KL}(\gamma_2 \mid \nu) \end{equation}

which admits a scaling iteration analogous to Sinkhorn. Gromov-Wasserstein (Mémoli 2011) handles the dual difficulty: source and target live in different metric spaces (graphs of different sizes, point clouds in different ambient dimensions), so there is no shared cost c(x,y)c(x,y); GW instead matches pair-distances via a quadratic objective minγijkl|dikXdjlY|2γijγkl\min_\gamma \sum_{ijkl} |d^X_{ik} - d^Y_{jl}|^2 \gamma_{ij}\gamma_{kl}. The objective is a quadratic assignment problem, NP-hard in general and non-convex in the coupling; entropic GW solvers depend strongly on initialization, with random / spectral / identity warm-starts producing different local optima36.

The unbalanced and GW extensions are not Pareto-dominated by classical OT; they enable problems (unequal-mass biology, graph matching) that classical OT cannot even formulate. But all of the above methods require either a discrete cost matrix or a fixed-support approximation, blocking application to the continuous high-dimensional distributions that arise in generative modeling. Generation 5 abandons the matrix entirely.

Neural-OT methods (Korotin, Selikhanovych, and Burnaev 2023; Seguy et al. 2018; Amos, Xu, and Kolter 2017) parameterize Kantorovich potentials f,gf, g (or the Brenier potential ϕ\phi) as neural networks and solve the dual saddle-point min-max objective by stochastic gradient ascent-descent. Input convex neural networks (ICNNs) (Amos, Xu, and Kolter 2017) enforce the convexity required by Brenier's theorem via non-negative weights and convex non-decreasing activations; this guarantees the gradient T(x)=ϕθ(x)T(x) = \nabla \phi_\theta(x) is a valid Brenier map. The price is the adversarial training instability of all GAN-class objectives37: hyperparameter sensitivity, mode collapse, and oscillation; mitigated by careful learning-rate scheduling, gradient penalties, and warm-starting from Sinkhorn-derived initial maps. JKOnet (Bunne et al. 2022) uses ICNNs to discretize Wasserstein gradient flows via the JKO scheme, learning energy landscapes from population snapshots.

Generation 5 (2022-2024): Flow-Matching, Schrödinger Bridges, and Unification

OT flow-matching and Schrödinger-bridge methods are the 2022-2024 generative-modelling synthesis. Flow-matching trains a velocity field whose ODE integration transports noise to data; the OT-conditional variant uses mini-batch OT couplings (rather than the diffusion's stochastic noise schedule) to define straighter probability paths, which simpler velocity fields can learn and which integrate in fewer ODE solver steps38. Schrödinger bridges solve the entropy-regularized OT problem in path-measure space via iterative proportional fitting (the dynamic analog of Sinkhorn), interpolating between deterministic OT (ε0\varepsilon \to 0) and pure diffusion (ε\varepsilon \to \infty).

The five lineages (exact-LP, entropic, sliced, unbalanced/GW, and neural/flow) converge on one principle: recover a coupling or transport map that minimizes a transport cost under the marginal and feasibility constraints your compute budget can enforce. Brenier potentials, Schrödinger bridges, and continuous-time normalizing flows are different parameterizations of the same underlying displacement-interpolation theory introduced by Brenier (Brenier 1991) and Benamou-Brenier (Benamou and Brenier 2000) three decades earlier. The figure below maps all five lineages onto a feasibility-vs-fidelity trade-off plane.

Feasibility-vs-fidelity map of the five OT lineages. The horizontal axis measures fidelity cost (0 = exact LP ground truth, 10 = high bias relative to true W2W_2); the vertical axis measures computational speed (0 = slowest, 10 = fastest). The ideal method sits at top-left (zero bias, infinite speed); no method achieves this. The exact-LP family (blue X) occupies the bottom-left: zero error but computationally prohibitive beyond n104n \approx 10^4. The entropic family (orange circles) breaks left by trading 2-15% bias for massive GPU parallelism; Sinkhorn divergences (orange) pull back toward the fidelity axis via debiasing. The sliced family (green diamonds) achieves the highest speed by projecting to 1D, at the cost of projection bias that degrades on non-Gaussian manifolds. The unbalanced and GW families (purple hexagons, brown triangles) are off the main Pareto curve because they address different formulation problems (mass inequality, cross-space matching) rather than trading bias for speed. The neural family (red triangles) accepts slow training (hours) to reach good bias at continuous-measure scale where Sinkhorn cannot form the cost matrix. The flow-matching / Schrödinger bridge family (cyan squares) combines OT-coupled paths with neural ODE integration, reaching 1-2% relative error at inference-time speeds competitive with entropic OT. The convergence of all five lineages onto the upper-left region (2-3% error, GPU-feasible) is the defining achievement of the 2022-2024 period.

References

Agrawal, Akshay, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and Zico Kolter. 2019. "Differentiable Convex Optimization Layers." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1910.12430.
Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savaré. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. 2nd ed. Birkhäuser. https://doi.org/10.1007/978-3-7643-8722-8.
Amos, Brandon, and J. Zico Kolter. 2017. "OptNet: Differentiable Optimization as a Layer in Neural Networks." In International Conference on Machine Learning (ICML). https://doi.org/10.48550/arXiv.1703.00443.
Amos, Brandon, Ivan Dario Jimenez Rodriguez, Jacob Sacks, Byron Boots, and J. Zico Kolter. 2018. "Differentiable MPC for End-to-End Planning and Control." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1810.13400.
Amos, Brandon, Lei Xu, and J. Zico Kolter. 2017. "Input Convex Neural Networks." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1609.07152.
Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. "Wasserstein GAN." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1701.07875.
Barratt, Shane. 2019. "On the Differentiability of the Solution to Convex Optimization Problems." arXiv. https://doi.org/10.48550/arXiv.1804.05098.
Benamou, Jean-David, and Yann Brenier. 2000. "A Computational Fluid Mechanics Solution to the Monge-Kantorovich Mass Transfer Problem." Numerische Mathematik 84 (3): 375-93. https://doi.org/10.1007/s002110050002.
Berthet, Quentin, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. 2020. "Learning with Differentiable Perturbed Optimizers." In Advances in Neural Information Processing Systems (NeurIPS). Vol. 33. http://arxiv.org/abs/2002.08676.
Blondel, Mathieu, André F. T. Martins, and Vlad Niculae. 2020. "Learning with Fenchel-Young Losses." https://doi.org/10.48550/arXiv.1901.02324.
Brenier, Yann. 1991. "Polar Factorization and Monotone Rearrangement of Vector‐valued Functions." Communications on Pure and Applied Mathematics 44 (4): 375-417. https://doi.org/10.1002/cpa.3160440402.
Brochu, Eric, Vlad M. Cora, and Nando de Freitas. 2010. "A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning." arXiv. https://doi.org/10.48550/arXiv.1012.2599.
Brochu, Eric, Matthew W. Hoffman, and Nando de Freitas. 2011. "Portfolio Allocation for Bayesian Optimization." arXiv. https://doi.org/10.48550/arXiv.1009.5419.
Bunne, Charlotte, Laetitia Meng-Papaxanthos, Andreas Krause, and Marco Cuturi. 2022. "Proximal Optimal Transport Modeling of Population Dynamics." https://doi.org/10.48550/arXiv.2106.06345.
Chizat, Lenaic, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. 2018. "Unbalanced Optimal Transport: Dynamic and Kantorovich Formulation." Journal of Functional Analysis, February. https://doi.org/10.48550/arXiv.1508.05216.
Courty, Nicolas, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. 2017. "Optimal Transport for Domain Adaptation." IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), June. https://doi.org/10.48550/arXiv.1507.00504.
Cuturi, Marco. 2013. "Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1306.0895.
Cuturi, Marco, and Arnaud Doucet. 2014. "Fast Computation of Wasserstein Barycenters." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1310.4375.
Domke, Justin. 2010. "Implicit Differentiation by Perturbation." In Advances in Neural Information Processing Systems. Vol. 23. Curran Associates, Inc. https://papers.nips.cc/paper_files/paper/2010/hash/6ecbdd6ec859d284dc13885a37ce8d81-Abstract.html.
Duvenaud, David, Hannes Nickisch, and Carl Edward Rasmussen. 2011. "Additive Gaussian Processes." arXiv. https://doi.org/10.48550/arXiv.1112.4394.
Dvurechensky, Pavel, Alexander Gasnikov, and Alexey Kroshnin. 2018. "Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn's Algorithm." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1802.04367.
Elmachtoub, Adam N., and Paul Grigas. 2022. "Smart "Predict, Then Optimize"." Management Science, November. https://doi.org/10.48550/arXiv.1710.08005.
Eriksson, David, and Martin Jankowiak. 2021. "High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces." arXiv. https://doi.org/10.48550/arXiv.2103.00349.
Eriksson, David, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. 2019. "Scalable Global Optimization via Local Bayesian Optimization."
Falkner, Stefan, Aaron Klein, and Frank Hutter. 2018. "BOHB: Robust and Efficient Hyperparameter Optimization at Scale." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1807.01774.
Feurer, Matthias, Benjamin Letham, Frank Hutter, and Eytan Bakshy. 2022. "Practical Transfer Learning for Bayesian Optimization." Transactions on Machine Learning Research (TMLR), October. https://doi.org/10.48550/arXiv.1802.02219.
Franceschi, Luca, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil. 2018. "Bilevel Programming for Hyperparameter Optimization and Meta-Learning." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1806.04910.
Frazier, Peter I. 2018. "A Tutorial on Bayesian Optimization." arXiv. https://doi.org/10.48550/arXiv.1807.02811.
Gangbo, Wilfrid, and Robert J. McCann. 1996. "The Geometry of Optimal Transportation." Acta Mathematica 177 (2): 113-61. https://doi.org/10.1007/BF02392620.
Gardner, Jacob R., Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. 2018. "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1809.11165.
Genevay, Aude, Marco Cuturi, Gabriel Peyré, and Francis Bach. 2016. "Stochastic Optimization for Large-Scale Optimal Transport." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1605.08527.
Genevay, Aude, Gabriel Peyré, and Marco Cuturi. 2018. "Learning Generative Models with Sinkhorn Divergences." In International Conference on Artificial Intelligence and Statistics (AISTATS). arXiv. https://doi.org/10.48550/arXiv.1706.00292.
Gould, Stephen, Richard Hartley, and Dylan Campbell. 2021. "Deep Declarative Networks: A New Hope." IEEE Transactions on Pattern Analysis and Machine Intelligence, 1-1. https://doi.org/10.1109/TPAMI.2021.3059462.
Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. "Improved Training of Wasserstein GANs." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1704.00028.
Ha, Huong, Santu Rana, Sunil Gupta, Thanh Nguyen, Hung Tran-The, and Svetha Venkatesh. 2019. "Bayesian Optimization with Unknown Search Space." arXiv. https://doi.org/10.48550/arXiv.1910.13092.
Hernández-Lobato, José Miguel, Michael A. Gelbart, Ryan P. Adams, Matthew W. Hoffman, and Zoubin Ghahramani. 2016. "A General Framework for Constrained Bayesian Optimization Using Information-Based Search." Journal of Machine Learning Research (JMLR), September. https://doi.org/10.48550/arXiv.1511.09422.
Hernández-Lobato, José Miguel, Matthew W. Hoffman, and Zoubin Ghahramani. 2014. "Predictive Entropy Search for Efficient Global Optimization of Black-Box Functions." arXiv. https://doi.org/10.48550/arXiv.1406.2541.
Jones, Donald R., Matthias Schonlau, and William J. Welch. 1998. "Efficient Global Optimization of Expensive Black-Box Functions." Journal of Global Optimization 13 (4): 455-92. https://doi.org/10.1023/A:1008306431147.
Kandasamy, Kirthevasan, Gautam Dasarathy, Jeff Schneider, and Barnabas Poczos. 2017. "Multi-Fidelity Bayesian Optimisation with Continuous Approximations." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1703.06240.
Kandasamy, Kirthevasan, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. 2018. "Neural Architecture Search with Bayesian Optimisation and Optimal Transport." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1802.07191.
Klamkin, Michael, Mathieu Tanneau, and Pascal Van Hentenryck. 2025. "Dual Interior Point Optimization Learning." arXiv. https://doi.org/10.48550/arXiv.2402.02596.
Kolouri, Soheil, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. 2018. "Sliced-Wasserstein Autoencoder: An Embarrassingly Simple Generative Model." arXiv. https://doi.org/10.48550/arXiv.1804.01947.
Kool, Wouter, Herke van Hoof, and Max Welling. 2019. "Attention, Learn to Solve Routing Problems!" In International Conference on Learning Representations (ICLR). arXiv. https://doi.org/10.48550/arXiv.1803.08475.
Korotin, Alexander, Daniil Selikhanovych, and Evgeny Burnaev. 2023. "Neural Optimal Transport." In International Conference on Learning Representations (ICLR). arXiv. https://doi.org/10.48550/arXiv.2201.12220.
Kotary, James, Ferdinando Fioretto, and Pascal Van Hentenryck. 2021. "Learning Hard Optimization Problems: A Data Generation Perspective." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.2106.02601.
Le, Tam, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. 2019. "Tree-Sliced Variants of Wasserstein Distances." In Advances in Neural Information Processing Systems (NeurIPS). Vol. 32. http://arxiv.org/abs/1902.00342.
Lee, Kwonjoon, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. 2019. "Meta-Learning With Differentiable Convex Optimization." In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10649-57. Long Beach, CA, USA: IEEE. https://doi.org/10.1109/CVPR.2019.01091.
Letham, Benjamin, and Eytan Bakshy. 2019. "Bayesian Optimization for Policy Search via Online-Offline Experimentation." Annals of Applied Statistics.
Li, Shibo, Wei Xing, Mike Kirby, and Shandian Zhe. 2020. "Multi-Fidelity Bayesian Optimization via Deep Neural Networks." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.2007.03117.
Lorraine, Jonathan, Paul Vicol, and David Duvenaud. 2020. "Optimizing Millions of Hyperparameters by Implicit Differentiation." In International Conference on Artificial Intelligence and Statistics (AISTATS). arXiv. https://doi.org/10.48550/arXiv.1911.02590.
Matthews, Alexander G. de G., James Hensman, Richard E. Turner, and Zoubin Ghahramani. 2016. "On Sparse Variational Methods and the Kullback-Leibler Divergence Between Stochastic Processes." In International Conference on Artificial Intelligence and Statistics (AISTATS). arXiv. https://doi.org/10.48550/arXiv.1504.07027.
Mémoli, Facundo. 2011. "Gromov - Wasserstein Distances and the Metric Approach to Object Matching." Foundations of Computational Mathematics 11 (4): 417-87. https://doi.org/10.1007/s10208-011-9093-5.
Mensch, Arthur, and Mathieu Blondel. 2018. "Differentiable Dynamic Programming for Structured Prediction and Attention." In International Conference on Machine Learning (ICML). https://doi.org/10.48550/arXiv.1802.03676.
Pan, Jianming, Zeqi Ye, Xiao Yang, Xu Yang, Weiqing Liu, Lewen Wang, and Jiang Bian. 2024. "BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.2411.19285.
Peyré, Gabriel, and Marco Cuturi. 2020. Computational Optimal Transport. Foundations and Trends in Machine Learning. arXiv. https://doi.org/10.48550/arXiv.1803.00567.
Prates, Marcelo O. R., Pedro H. C. Avelar, Henrique Lemos, Luis Lamb, and Moshe Vardi. 2019. "Learning to Solve NP-Complete Problems - A Graph Neural Network for Decision TSP." In AAAI Conference on Artificial Intelligence. arXiv. https://doi.org/10.48550/arXiv.1809.02721.
Rachev, S. T. 1985. "The Monge - Kantorovich Mass Transference Problem and Its Stochastic Applications." Theory of Probability & Its Applications, July. https://doi.org/10.1137/1129093.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press.
Sander, Michaël E., Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. 2022. "Sinkformers: Transformers with Doubly Stochastic Attention." https://doi.org/10.48550/arXiv.2110.11773.
Santambrogio, Filippo. 2014. "Introduction to Optimal Transport Theory." In Optimal Transport: Theory and Applications (London Math. Soc. Lecture Note Series 413). arXiv. https://doi.org/10.48550/arXiv.1009.3856.
- - - . 2015. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Vol. 87. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-20828-2.
Seguy, Vivien, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. 2018. "Large-Scale Optimal Transport and Mapping Estimation." In International Conference on Learning Representations (ICLR). arXiv. https://doi.org/10.48550/arXiv.1711.02283.
Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. 2016. "Taking the Human Out of the Loop: A Review of Bayesian Optimization." Proceedings of the IEEE 104 (1): 148-75. https://doi.org/10.1109/JPROC.2015.2494218.
Snelson, Edward, and Zoubin Ghahramani. 2006. "Sparse Gaussian Processes Using Pseudo-Inputs." In Advances in Neural Information Processing Systems (NeurIPS).
Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012. "Practical Bayesian Optimization of Machine Learning Algorithms." arXiv. https://doi.org/10.48550/arXiv.1206.2944.
Srinivas, Niranjan, Andreas Krause, Sham M. Kakade, and Matthias Seeger. 2012. "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design." IEEE Transactions on Information Theory 58 (5): 3250-65. https://doi.org/10.1109/TIT.2011.2182033.
Stanton, Samuel, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. 2022. "Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders." arXiv. https://doi.org/10.48550/arXiv.2203.12742.
Sui, Yanan, Alkis Gotovos, Joel W Burdick, and Andreas Krause. 2015. "Safe Exploration for Optimization with Gaussian Processes." In International Conference on Machine Learning (ICML).
Swersky, Kevin, Jasper Snoek, and Ryan P Adams. 2013. "Multi-Task Bayesian Optimization." In Advances in Neural Information Processing Systems (NeurIPS).
Swersky, Kevin, Jasper Snoek, and Ryan Prescott Adams. 2014. "Freeze-Thaw Bayesian Optimization." arXiv. https://doi.org/10.48550/arXiv.1406.3896.
Tang, Bo, and Elias B. Khalil. 2024. "CaVE: A Cone-Aligned Approach for Fast Predict-Then-Optimize with Binary Linear Programs." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.2312.07718.
Titsias, Michalis K. 2009. "Variational Learning of Inducing Variables in Sparse Gaussian Processes." In International Conference on Artificial Intelligence and Statistics (AISTATS).
Villani, Cédric. 2003. Topics in Optimal Transportation. Vol. 58. Graduate Studies in Mathematics. American Mathematical Society.
- - - . 2009. Optimal Transport: Old and New. Springer. https://doi.org/10.1007/978-3-540-71050-9.
Vlastelica, Marin, Anselm Paulus, Vít Musil, Georg Martius, and Michal Rolínek. 2020. "Differentiation of Blackbox Combinatorial Solvers." In International Conference on Learning Representations (ICLR). arXiv. https://doi.org/10.48550/arXiv.1912.02175.
Wang, Jialei, Scott C. Clark, Eric Liu, and Peter I. Frazier. 2020. "Parallel Bayesian Global Optimization of Expensive Functions." Operations Research, May. https://doi.org/10.48550/arXiv.1602.05149.
Wang, Po-Wei, Priya L. Donti, Bryan Wilder, and Zico Kolter. 2019. "SATNet: Bridging Deep Learning and Logical Reasoning Using a Differentiable Satisfiability Solver." In International Conference on Machine Learning (ICML), 97:6545-54. Proceedings of Machine Learning Research. PMLR. http://arxiv.org/abs/1905.12149.
Wang, Ziyu, Masrour Zoghi, Frank Hutter, David Matheson, and Nando de Freitas. 2016. "Bayesian Optimization in High Dimensions via Random Embeddings." Journal of Artificial Intelligence Research (JAIR).

  1. Decision-vs-prediction loss mismatch: minimum MSE on parameters is not minimum regret on decisions; errors that cross a constraint or objective boundary are arbitrarily more costly than errors that don't.↩︎

  2. Active-set discontinuity: the solver's argmax is piecewise-linear in its parameters with jump discontinuities at active-set transitions; a small input perturbation can flip which constraints bind, producing infinite local Lipschitz constants that defeat first-order training.↩︎

  3. Ill-conditioned KKT system: at the active-set boundary the KKT matrix becomes singular and its condition number diverges; gradient noise then dominates the signal and training oscillates without explicit regularization (typically a small ϵI\epsilon I on the slack block).↩︎

  4. Implicit-differentiation cost: each backward pass solves an O(n2)O(n^2)-O(n3)O(n^3) linear system in the inner-problem variables, so wrapping an LP/QP solver inside a training loop adds 5-50x wall-clock overhead per step relative to plain autograd.↩︎

  5. Decision-vs-prediction loss mismatch: minimum MSE on parameters is not minimum regret on decisions; errors that cross a constraint or objective boundary are arbitrarily more costly than errors that don't.↩︎

  6. Non-differentiable solver: branch-and-bound, simplex, and SAT solvers are piecewise-constant in their parameters, so naive autograd returns zero gradients; smoothing, blackbox finite differences, or perturbed-MAP estimators are required.↩︎

  7. Integer-relaxation gap: continuous relaxations (LP-relaxation, Sinkhorn, perturbed argmax) have a gap to the true integer optimum that does not shrink with predictor accuracy; the learned solution can be near-optimal in relaxation yet infeasible after rounding.↩︎

  8. Constraint violation: an unconstrained NN output may violate the feasible set; projection layers, penalty terms, and dual updates each trade feasibility against optimality and gradient quality.↩︎

  9. Scale mismatch between predictor and solver: the predictor's output scale (network logits) is rarely the solver's parameter scale (cost coefficients in the same units as the objective); without per-feature standardization or a learned output gain, gradients through the solver are dominated by the largest-magnitude coordinate and training stalls.↩︎

  10. Failure to exploit problem structure: a generic OptNet/CvxpyLayer ignores LP-totally-unimodular structure, network-flow decompositions, or block-diagonal Hessians that classical solvers exploit, so neural-augmented pipelines often run slower than tuned commercial solvers on structured instances despite better worst-case generalization.↩︎

  11. Implicit-differentiation cost: each backward pass solves an O(n2)O(n^2)-O(n3)O(n^3) linear system in the inner-problem variables, so wrapping an LP/QP solver inside a training loop adds 5-50x wall-clock overhead per step relative to plain autograd.↩︎

  12. Ill-conditioned KKT system: at the active-set boundary the KKT matrix becomes singular and its condition number diverges; gradient noise then dominates the signal and training oscillates without explicit regularization (typically a small ϵI\epsilon I on the slack block).↩︎

  13. Implicit-differentiation cost: each backward pass solves an O(n2)O(n^2)-O(n3)O(n^3) linear system in the inner-problem variables, so wrapping an LP/QP solver inside a training loop adds 5-50x wall-clock overhead per step relative to plain autograd.↩︎

  14. Failure to exploit problem structure: a generic OptNet/CvxpyLayer ignores LP-totally-unimodular structure, network-flow decompositions, or block-diagonal Hessians that classical solvers exploit, so neural-augmented pipelines often run slower than tuned commercial solvers on structured instances despite better worst-case generalization.↩︎

  15. Non-differentiable solver: branch-and-bound, simplex, and SAT solvers are piecewise-constant in their parameters, so naive autograd returns zero gradients; smoothing, blackbox finite differences, or perturbed-MAP estimators are required.↩︎

  16. Decision-vs-prediction loss mismatch: minimum MSE on parameters is not minimum regret on decisions; errors that cross a constraint or objective boundary are arbitrarily more costly than errors that don't.↩︎

  17. Integer-relaxation gap: continuous relaxations (LP-relaxation, Sinkhorn, perturbed argmax) have a gap to the true integer optimum that does not shrink with predictor accuracy; the learned solution can be near-optimal in relaxation yet infeasible after rounding.↩︎

  18. Non-differentiable solver: branch-and-bound, simplex, and SAT solvers are piecewise-constant in their parameters, so naive autograd returns zero gradients; smoothing, blackbox finite differences, or perturbed-MAP estimators are required.↩︎

  19. Integer-relaxation gap: continuous relaxations (LP-relaxation, Sinkhorn, perturbed argmax) have a gap to the true integer optimum that does not shrink with predictor accuracy; the learned solution can be near-optimal in relaxation yet infeasible after rounding.↩︎

  20. Ill-conditioned KKT system: at the active-set boundary the KKT matrix becomes singular and its condition number diverges; gradient noise then dominates the signal and training oscillates without explicit regularization (typically a small ϵI\epsilon I on the slack block).↩︎

  21. Constraint violation: an unconstrained NN output may violate the feasible set; projection layers, penalty terms, and dual updates each trade feasibility against optimality and gradient quality.↩︎

  22. Cubic exact-LP cost: classical network-simplex / Hungarian algorithms solve the discrete Kantorovich LP in O(n3logn)O(n^3 \log n) time, prohibitive beyond n104n \approx 10^4 and the principal motivation for entropic regularization.↩︎

  23. Entropy regularization bias: the Sinkhorn solution is not the true Wasserstein-2 distance; Wε(μ,μ)>0W_\varepsilon(\mu,\mu) > 0 for non-Dirac μ\mu, and the entropic plan is "blurred" by an amount controlled by ε\varepsilon; Sinkhorn divergences correct this at the cost of three solver runs.↩︎

  24. Curse of dimensionality: empirical OT estimation rate is 𝔼[Wp(μ̂n,μ)]=O(n1/d)\mathbb{E}[W_p(\hat\mu_n, \mu)] = O(n^{-1/d}) in dd dimensions, which is catastrophic for image/text data; sliced and entropic estimators trade bias for dimension-free O(n1/2)O(n^{-1/2}) rates.↩︎

  25. Unbalanced-mass mismatch: classical OT requires equal source and target mass, an unrealistic constraint when comparing single-cell snapshots, partial-shape correspondences, or class-imbalanced datasets; relaxed marginals via KL or TV penalties are required.↩︎

  26. Non-convex Gromov-Wasserstein: the GW objective is a quadratic assignment problem that is non-convex in the coupling; local minima are pervasive and entropic GW solvers depend strongly on initialization (random vs spectral vs identity).↩︎

  27. Dual-potential training instability: neural-OT solvers (Korotin, Selikhanovych, and Burnaev 2023) parameterize Kantorovich potentials f,gf, g as competing networks; the resulting saddle-point optimization inherits the adversarial dynamics of GANs (mode collapse, oscillation, hyperparameter sensitivity).↩︎

  28. Sample complexity: an accurate n×nn \times n cost matrix needs O(n)O(n) samples per distribution and O(n2)O(n^2) cost evaluations, dominating compute budgets when the cost (e.g., embedding distance) is itself expensive; mini-batch OT trades bias for cost.↩︎

  29. Sliced-projection bias: 1D-projection-based estimators average univariate Wasserstein distances over directions of Sd1S^{d-1}; the resulting metric is weaker than WpW_p (only equivalent on Gaussians) and loses sharpness on distributions concentrated on low-dimensional manifolds.↩︎

  30. Non-differentiability of the cost surface: exact OT plans are sparse (2n1\le 2n-1 nonzeros) and discrete; their gradient with respect to inputs is undefined at vertex changes; entropic OT smooths this at the cost of bias, and structured loss functions (Sinkhorn divergence, Sliced) are differentiable but no longer the true Wasserstein.↩︎

  31. Cubic exact-LP cost: classical network-simplex / Hungarian algorithms solve the discrete Kantorovich LP in O(n3logn)O(n^3 \log n) time, prohibitive beyond n104n \approx 10^4 and the principal motivation for entropic regularization.↩︎

  32. Entropy regularization bias: the Sinkhorn solution is not the true Wasserstein-2 distance; Wε(μ,μ)>0W_\varepsilon(\mu,\mu) > 0 for non-Dirac μ\mu, and the entropic plan is "blurred" by an amount controlled by ε\varepsilon; Sinkhorn divergences correct this at the cost of three solver runs.↩︎

  33. Curse of dimensionality: empirical OT estimation rate is 𝔼[Wp(μ̂n,μ)]=O(n1/d)\mathbb{E}[W_p(\hat\mu_n, \mu)] = O(n^{-1/d}) in dd dimensions, which is catastrophic for image/text data; sliced and entropic estimators trade bias for dimension-free O(n1/2)O(n^{-1/2}) rates.↩︎

  34. Sliced-projection bias: 1D-projection-based estimators average univariate Wasserstein distances over directions of Sd1S^{d-1}; the resulting metric is weaker than WpW_p (only equivalent on Gaussians) and loses sharpness on distributions concentrated on low-dimensional manifolds.↩︎

  35. Unbalanced-mass mismatch: classical OT requires equal source and target mass, an unrealistic constraint when comparing single-cell snapshots, partial-shape correspondences, or class-imbalanced datasets; relaxed marginals via KL or TV penalties are required.↩︎

  36. Non-convex Gromov-Wasserstein: the GW objective is a quadratic assignment problem that is non-convex in the coupling; local minima are pervasive and entropic GW solvers depend strongly on initialization (random vs spectral vs identity).↩︎

  37. Dual-potential training instability: neural-OT solvers (Korotin, Selikhanovych, and Burnaev 2023) parameterize Kantorovich potentials f,gf, g as competing networks; the resulting saddle-point optimization inherits the adversarial dynamics of GANs (mode collapse, oscillation, hyperparameter sensitivity).↩︎

  38. Flow-step cost: OT-conditional flow matching trains a velocity field whose ODE integration at inference time still costs 10-50\sim 10\text{-}50 function evaluations; the OT path is straighter than the diffusion path but not free, and the mini-batch OT pre-computation adds per-step O(m2)O(m^2) overhead.↩︎