Generative Models - Adnan Harun Dogan

Autoregressive Models for Visual Generation

Autoregressive visual generation evolved through four overlapping waves: (i) pixel-level models that factorise the image distribution as a product of conditionals over raw RGB values (PixelRNN (A. van den Oord, Kalchbrenner, and Kavukcuoglu 2016a), PixelCNN (A. van den Oord et al. 2016), PixelCNN++ (Salimans et al. 2017)); (ii) discrete tokenisation that compresses pixels into learned codebooks before autoregression (VQ-VAE (A. van den Oord, Vinyals, and Kavukcuoglu 2017), VQ-VAE-2 (Razavi, Oord, and Vinyals 2019), RQ-VAE (D. Lee et al. 2022)); (iii) transformer-based generation that ports GPT-style decoders to image and text-to-image (iGPT (M. Chen et al. 2020), DALL-E (Ramesh et al. 2021), Parti (Yu et al. 2022), LlamaGen (Sun et al. 2024)), inheriting the encoder-decoder sequence-modelling architecture from neural machine translation (Sutskever's sequence-to-sequence formulation (Sutskever, Vinyals, and Le 2014) and Bahdanau's content-based attention (Bahdanau, Cho, and Bengio 2015)) where the conditional-token-by-token factorisation was first stress-tested at scale; and (iv) alternative orderings and parallel decoding that break the strict raster-scan dependency (MaskGIT (Chang et al. 2022), MAGE (T. Li et al. 2023), Muse (Chang et al. 2023), ARDM (Hoogeboom et al. 2022), VAR (Tian et al. 2024), Infinity (Han et al. 2025), InfinityStar (J. Liu et al. 2025), VARGPT (Zhuang et al. 2025)). Video extends these designs along the temporal axis (VideoGPT (Yan et al. 2021), TATS (Ge et al. 2022), Phenaki (Villegas et al. 2023), MCVD (Voleti, Jolicoeur-Martineau, and Pal 2022)) and scene-conditioned generation injects layout priors (Make-A-Scene (Gafni et al. 2022), Semantic-Diffusion (Weilun Wang et al. 2023)). Theoretical scaffolding around scale and substrate is provided by autoregressive scaling laws (Henighan et al. 2020), ViT scaling (Zhai et al. 2022), and hardware-lottery analysis (Hooker 2020).

The state-of-the-art landscape on standard image-generation benchmarks shows AR models reaching parity with and in places exceeding diffusion. The leaderboard below collates the headline numbers (ImageNet-256 FID for class-conditional, zero-shot MS-COCO FID for text-to-image where applicable; tokenizer column reports the discrete code source; latency is per-image at the listed resolution; "Cite" anchors to the in-chapter section).

Autoregressive image-generation leaderboard. ImageNet-256 FID is class-conditional; COCO-FID is zero-shot at 256 $\times$ 256; latency is wall-clock seconds per image at the listed deployment resolution. Tokenizer column lists the discrete code source. Empty cells denote not-reported or not-applicable. Models tagged "masked" use parallel iterative decoding; "next-scale" use coarse-to-fine scale autoregression; "raster-AR" use sequential next-token. The "first AR model to beat diffusion on ImageNet" milestone is VAR-d30 at FID 1.73, narrowly edging DiT-XL/2's 2.27 at the same parameter count.
Method	Year	Params	Tokenizer	Decoding	ImageNet-FID	COCO-FID	Latency
PixelRNN (A. van den Oord, Kalchbrenner, and Kavukcuoglu 2016a)	2016	~50M	Raw 8-bit pixel	raster-AR	3.00 bpd		very slow
PixelCNN (A. van den Oord et al. 2016)	2016	~80M	Raw 8-bit pixel	raster-AR	3.03 bpd		slow
PixelCNN++ (Salimans et al. 2017)	2017	~95M	Raw + logistic mix	raster-AR	2.92 bpd		slow
VQ-VAE-2 (Razavi, Oord, and Vinyals 2019)	2019	~13B*	Hierarchical VQ	raster-AR	31.11		minutes
iGPT-L (M. Chen et al. 2020)	2020	1.4B	k-means 9-bit pixel	raster-AR	36.4		minutes
DALL-E (Ramesh et al. 2021)	2021	12B	dVAE 8192	raster-AR		~17	~30s/64 cands
MaskGIT (Chang et al. 2022)	2022	227M	VQ-GAN 1024	masked	6.18		~0.5s
Make-A-Scene (Gafni et al. 2022)	2022	4B	Triple-VQ	raster-AR		11.84	~10s
Parti-20B (Yu et al. 2022)	2022	20B	ViT-VQGAN 8192	raster-AR		7.23	~15s
MAGE-L (T. Li et al. 2023)	2023	300M	VQ-GAN 1024	masked	6.93		~0.5s
Muse (Chang et al. 2023)	2023	3B+3B	ViT-VQGAN 8192	masked+SR		7.88	1.3s
LlamaGen-3B (Sun et al. 2024)	2024	3.1B	VQ-GAN 16384	raster-AR	2.18	7.2	~2s w/ vLLM
VAR-d30 (Tian et al. 2024)	2024	2.0B	Multi-scale RQ-VAE	next-scale	1.73		~0.3s
Infinity-2B (Han et al. 2025)	2025	2B	LFQ 32-bit (no book)	next-scale		0.69 GE	0.8s @ 1024
InfinityStar (J. Liu et al. 2025)	2025	8B	3D LFQ	next-scale		83.74 VB	40s/5s 720p
VARGPT (Zhuang et al. 2025)	2025	7B	VAR-style + CLIP	next-scale		14.8	~0.3s+text

(*) VQ-VAE-2's 13B figure aggregates the top-prior 48-layer Transformer (~12B) with the bottom-prior 20-layer PixelCNN (~1B); per-paper accounting practice varies. "GE" denotes GenEval composite score (text-to-image alignment, higher is better, 0-1) and "VB" denotes VBench composite (text-to-video, higher is better, 0-100); these benchmarks are not directly comparable to FID. Pixel-level rows report bits-per-dimension instead of FID because Inception-v3 features are uninformative at 32 $\times$ 32 resolution.

Deep Generative Models

Generative modeling aims to learn the underlying data distribution $p_{\text{data}}(x)$ and generate new samples from it. Classical approaches include Gaussian mixture models (Reynolds 2009), kernel density estimation, and graphical models (Bayesian networks, Markov random fields). These methods scale poorly to high-dimensional data: Gaussian mixtures require exponentially many components, kernel methods suffer from the curse of dimensionality, and graphical-model inference is intractable for dense dependencies. Deep learning revolutionized this field by enabling learning of complex, high-dimensional distributions through expressive neural-network parameterizations with scalable optimization.

Variational Autoencoders (Kingma and Welling 2014) introduced amortized variational inference with neural networks, enabling efficient approximate posterior inference through the Evidence Lower Bound (ELBO). Extensions include importance-weighted bounds (Burda, Grosse, and Salakhutdinov 2016), disentangled representations (Higgins et al. 2017), and hierarchical architectures (Vahdat and Kautz 2020). The VAE framework connects to classical variational Bayes (Jordan et al. 1999) and neural autoencoders (Hinton and Salakhutdinov 2006), combining the probabilistic principles of the former with the representation-learning flexibility of the latter.

Generative Adversarial Networks (Goodfellow et al. 2014) frame generation as a game between generator and discriminator networks through the adversarial game. Key advances include stable training techniques (Gulrajani et al. 2017), progressive growing (Karras et al. 2018), and style-based generation (Karras, Laine, and Aila 2019). The game-theoretic framing inspired an array of subsequent ``adversarial'' methods: adversarial training (Madry et al.), adversarial domain adaptation, and adversarial robustness.

Normalizing Flows (Rezende and Mohamed 2015) provide exact likelihood computation through invertible transformations. Notable architectures include Real NVP (Dinh, Sohl-Dickstein, and Bengio 2017), MAF (Papamakarios, Pavlakou, and Murray 2018), Glow (Kingma and Dhariwal 2018), and Neural Spline Flows (Durkan et al. 2019). Flows generalize classical change-of-variables density estimation by using deep neural networks as the transformation, with careful design constraints to maintain invertibility and efficient Jacobian computation.

Score-based models (Y. Song and Ermon 2020) learn the gradient of the log-density and generate via Langevin dynamics. The connection to diffusion models (Ho, Jain, and Abbeel 2020) established a unified framework through stochastic differential equations (Y. Song et al. 2021). Score-based generation builds on score matching (Hyvärinen 2005) and denoising autoencoders (Vincent et al. 2008), with the continuous-time formulation drawing on results from stochastic calculus (Anderson's reverse-time diffusion theorem from 1982).

Autoregressive models (PixelRNN (A. van den Oord, Kalchbrenner, and Kavukcuoglu 2016b), PixelCNN (A. van den Oord, Kalchbrenner, et al. 2016), WaveNet (A. van den Oord, Dieleman, et al. 2016), GPT (Radford et al. 2019)) provide a competing paradigm that factorizes the joint via the chain rule and models each conditional with a deep neural network. For discrete data and sequences, autoregressive models are often the natural choice; for continuous high-dimensional data, they have been largely superseded by diffusion models due to sampling speed.

Recent work has focused on unifying these paradigms through the score SDE framework (Y. Song et al. 2021), Flow Matching (Lipman et al. 2023), and the Schrödinger bridge perspective (De Bortoli et al. 2021). The field increasingly views VAE/GAN/flow/diffusion as different instantiations of a common transport-map framework, with architectural and training-objective choices distinguishing them rather than fundamental paradigm differences.

Diffusion Models and Transformer Architectures

The diffusion-model literature spans pixel-space to latent-space, U-Net to transformer, single-model to cascade, and image to video; the leaderboard in Table tab:diffusion-sota anchors the dominant lineage by reported FID at canonical resolutions, parameter count, and inference NFE so each prose paragraph below grounds in a numerical comparator. The columns isolate the three axes that drive design choices: quality (FID), scale (parameters), and sampling efficiency (NFE); GAN baselines (BigGAN at 7.4 FID) are included to clarify the diffusion-vs-GAN crossover that ADM established and EDM widened.

Selected diffusion / flow / transformer milestones on standard benchmarks. FID rows mix ImageNet 256×256 (class-conditional), CIFAR-10 (unconditional), and zero-shot MS-COCO (text-to-image) at the resolution called out by each paper; FVD covers UCF-101. NFE = inference function evaluations at the reported FID. Citations point to the original sources cataloged in subsequent sections.
Model	Year	Backbone	Resolution / Task	FID	NFE	Params
BigGAN (GAN baseline)	2018	GAN	ImageNet 256	7.4	1	158M
DDPM (Ho, Jain, and Abbeel 2020)	2020	U-Net	CIFAR-10 uncond	3.17	1000	36M
DDIM (J. Song, Meng, and Ermon 2021)	2021	U-Net	CIFAR-10 (S=50)	4.67	50	36M
Improved DDPM (Nichol and Dhariwal 2021)	2021	U-Net	ImageNet 64 (bpd 2.94)	-	250	100M
ADM-G (Dhariwal and Nichol 2021)	2021	U-Net	ImageNet 256 cls-cond	4.59	250	554M
ADM-G + upsmp (Dhariwal and Nichol 2021)	2021	U-Net casc.	ImageNet 256 cls-cond	2.97	~ 250	~1.0B
EDM (Karras et al. 2022)	2022	U-Net	CIFAR-10 uncond	1.79	35	56M
LDM-8 (Rombach et al. 2022)	2022	U-Net	ImageNet 256 cls-cond	3.60	250	400M
GLIDE (Nichol et al. 2022)	2022	U-Net	MS-COCO zero-shot	12.2	250	3.5B
Imagen (Saharia et al. 2022)	2022	U-Net casc.	MS-COCO zero-shot	7.27	~ 256	~7.6B
DALL-E 2 (Ramesh et al. 2022)	2022	U-Net casc.	MS-COCO zero-shot	10.4	~ 250	4.5B
eDiff-I (Balaji et al. 2023)	2023	U-Net casc.	MS-COCO zero-shot	6.95	~ 250	~9B
DiT-XL/2 (Peebles and Xie 2023)	2023	DiT	ImageNet 256 (cfg=1.5)	2.27	250	675M
U-ViT (Bao et al. 2023)	2023	U-ViT	ImageNet 256 cls-cond	2.29	250	501M
SDXL (Podell et al. 2023)	2023	U-Net	T2I 1024 (PartiPrompt)	n/a	50	2.6B
PixArt- $\alpha$ (J. Chen et al. 2023)	2023	DiT	T2I 1024 (DrawBench)	n/a	20	600M
DALL-E 3 (Betker et al. 2023)	2023	LDM	DrawBench GPT-4V judge	n/a	~ 50	undisc.
Consistency D. (Y. Song et al. 2023)	2023	EDM-init	CIFAR-10 (1 step)	3.55	1	56M
Flow Matching (Lipman et al. 2023)	2023	EDM-net	ImageNet 64 (bpd 3.31)	-	142	62M
SD3 (MM-DiT) (Esser et al. 2024)	2024	MM-DiT	T2I 1024 (GenEval)	n/a	28	8B
SVD (Blattmann, Dockhorn, et al. 2023)	2023	LDM+temp.	UCF-101 (FVD)	~140	25	1.5B

Diffusion models have emerged as the dominant paradigm for generative modeling, surpassing GANs in image quality while offering stable training and mode coverage. The foundational work on Denoising Diffusion Probabilistic Models (DDPM) (Ho, Jain, and Abbeel 2020) demonstrated that iteratively denoising samples from Gaussian noise produces high-quality images, achieving state-of-the-art FID scores on CIFAR-10 (3.17) and competitive results on LSUN.[fn:s2-ddpm-impl] This work established the connection between diffusion models and denoising score matching, providing both theoretical grounding and practical training recipes.

Subsequent work focused on accelerating sampling, which originally required 1000+ steps. Denoising Diffusion Implicit Models (DDIM) (J. Song, Meng, and Ermon 2021) introduced a non-Markovian formulation enabling 10-50× faster sampling by treating the reverse process as solving an ODE rather than simulating an SDE. Improved DDPM (Nichol and Dhariwal 2021) showed that learning the variance schedule and using fewer steps maintains quality while reducing computation. DPM-Solver++ (Lu et al. 2022) further improved sampling efficiency through high-order ODE solvers specifically designed for guided diffusion, enabling 15-20 step generation.

The architecture evolution moved from U-Net to Transformer backbones. Diffusion Models Beat GANs (Dhariwal and Nichol 2021) demonstrated that architectural improvements (attention at multiple resolutions, adaptive group normalization) and classifier guidance enable diffusion models to surpass BigGAN on ImageNet. The EDM framework (Karras et al. 2022) unified the design space, achieving SOTA FID of 1.79 on CIFAR-10 through careful analysis of noise schedules, preconditioning, and sampling.

The Diffusion Transformer (DiT) (Peebles and Xie 2023) replaced U-Net with a Vision Transformer operating on latent patches, demonstrating clean scaling laws: higher compute (GFLOPs) consistently yields lower FID. DiT-XL/2 achieved 2.27 FID on ImageNet 256×256, establishing transformers as viable diffusion backbones. U-ViT (Bao et al. 2023) showed that treating time, condition, and image patches uniformly as tokens with long skip connections achieves comparable results with simpler architecture.

Text-to-image synthesis scaled these techniques dramatically. Latent Diffusion Models (LDM) (Rombach et al. 2022) moved diffusion to the latent space of a pretrained autoencoder, reducing computation by 4-16× while maintaining quality. GLIDE (Nichol et al. 2022) demonstrated classifier-free guidance outperforms CLIP guidance for photorealism. Imagen (Saharia et al. 2022) showed that scaling the text encoder (T5-XXL) matters more than scaling the image model, achieving breakthrough text-image alignment. DALL-E 2 (Ramesh et al. 2022) used a two-stage approach: CLIP prior generating image embeddings, then diffusion decoder.

The Stable Diffusion family democratized text-to-image. SDXL (Podell et al. 2023) scaled to 3× larger U-Net with dual text encoders, achieving quality competitive with closed-source models. PixArt- $\alpha$ (J. Chen et al. 2023) demonstrated efficient training by decomposing into stages (pixel dependency, text alignment, aesthetics), requiring only 10.8% of Stable Diffusion's training compute. Stable Diffusion 3 (Esser et al. 2024) introduced the MM-DiT architecture with separate weights for image and text modalities and rectified flow formulation.

Video generation extended these techniques temporally. Video Diffusion Models (Ho, Salimans, et al. 2022) added 3D attention for temporal coherence. Imagen Video (Ho, Chan, et al. 2022) used cascaded diffusion with spatial and temporal super-resolution. Video LDM (Blattmann, Rombach, et al. 2023) inserted temporal layers into pretrained image LDMs. Stable Video Diffusion (Blattmann, Dockhorn, et al. 2023) demonstrated the importance of data curation for video quality. AnimateDiff (Guo et al. 2024) showed motion modules can animate any personalized text-to-image model without retraining.

Flow matching (Lipman et al. 2023) provided an alternative training paradigm for continuous normalizing flows. Instead of learning a score function, flow matching directly regresses a velocity field that transports noise to data; the score $\nabla_x \log p_t(x)$ is replaced by a learned velocity $v_\theta(x, t)$ generating a probability path $p_t$ interpolating between noise $p_0 = \mathcal{N}(0, I)$ and data $p_1 = p_{\text{data}}$ . The conditional flow matching (CFM) objective is:

\begin{equation} \mathcal{L}{\text{CFM}} = \mathbb{E}{t, x_1, x_t} \|v_\theta(x_t, t) - u_t(x_t | x_1)\|^2 \end{equation}

where $u_t(x_t | x_1)$ is the target velocity for the conditional path $x_t = (1-t)x_0 + t x_1$ (linear interpolation with OT paths). Compared to score matching, CFM yields straighter trajectories that require fewer integration steps at inference (typically 20-50 NFEs vs. 100-1000 for standard diffusion). The OT variant uses $x_t = (1-(1-\sigma_{\min})t)x_0 + tx_1$ , producing nearly straight paths that can be sampled with Euler integration in as few as 5 steps. Consistency models (Y. Song et al. 2023) learn to directly map any noisy $x_t$ to $x_0$ via a self-consistency constraint $f_\theta(x_t, t) = f_\theta(x_{t'}, t')$ for all $t, t'$ on the same trajectory, enabling one-step generation.

Controllable generation became practical with ControlNet (Zhang, Rao, and Agrawala 2023), which adds spatial conditioning (edges, depth, pose) to pretrained models via trainable copies of encoder blocks connected with zero convolutions. InstructPix2Pix (Brooks, Holynski, and Efros 2023) enabled instruction-based editing by training on synthetic edit pairs. Prompt-to-Prompt (Hertz et al. 2022) showed that manipulating cross-attention maps enables localized editing without retraining.

The field continues advancing rapidly, with Sora ("SORA: Video Generation Models as World Simulators" 2024) demonstrating minute-long coherent video generation, suggesting diffusion transformers can serve as world simulators understanding physical dynamics and 3D consistency.

Vision-Language Models: From Contrastive Learning to Multimodal Assistants

The foundation of modern VLMs was established by CLIP (Radford et al. 2021), which demonstrated that contrastive learning on 400 million image-text pairs enables zero-shot transfer to diverse visual tasks.[fn:s3-clip-impl] CLIP's text tower inherits a long lineage: distributed word representations trained by contrasting context-word co-occurrences against negative samples (the word2vec construction of (Mikolov et al. 2013), which introduced negative sampling as a replacement for hierarchical softmax, frequent-word subsampling, and phrase representations, making billion-token corpora trainable on a single machine) were the first widely-deployed evidence that geometry in a learned embedding space reflects semantic structure, and the InfoNCE loss CLIP uses is essentially the same noise-contrastive estimator generalised to cross-modal positive pairs. ALIGN (Jia et al. 2021) showed that noisy web-scale data can substitute for curation when sufficient scale is achieved. SigLIP (Zhai et al. 2023) introduced a simpler sigmoid loss that scales more efficiently than softmax-based contrastive learning. The contrastive pre-training lineage also includes earlier work like VirTex (2020, caption-based supervision), ConVIRT (2020, medical domain), and UniCL (2022, unified classification and contrastive learning), with CLIP's influence coming from its combination of scale, simplicity, and public release. The longer arc reaches back to Baby Talk (Kulkarni et al. 2011), the first end-to-end pipeline that generated natural-language descriptions of unseen images from hand-engineered object/attribute/preposition detectors fed into a tree-CRF; that paper's central claim that visual content can be linearised into compositional language predates CLIP-style contrastive coupling by a decade and frames every subsequent VLM benchmark. Subsequent refinements include OpenCLIP (open reimplementation with documented data), FILIP (fine-grained token-level alignment), and MetaCLIP (careful data curation matching CLIP's original recipe).

Self-supervised visual learning reached new heights with DINOv2 (Oquab et al. 2024), which produces all-purpose visual features through careful data curation and training at scale (below). Chinese CLIP (Yang et al. 2023) extended contrastive pre-training to Chinese language.

The BLIP series (J. Li et al. 2022, 2023; Xue et al. 2024) introduced bootstrapping approaches that filter noisy captions and enable efficient adaptation of frozen encoders (BLIP, BLIP-2). InstructBLIP (Dai et al. 2023) demonstrated the power of instruction tuning for vision-language models. Parallel efforts in this space include ALBEF (2021, introduced the unified encoder-decoder design with image-text matching and contrastive objectives), OFA (2022, unified multimodal sequence-to-sequence learning), and mPLUG (2022-2023, modular cross-attention with skip connections). The BLIP line's distinguishing contribution was CapFilt-style bootstrapping, which has since become a standard technique across the VLM training pipeline.

Flamingo (Alayrac et al. 2022) pioneered few-shot learning in VLMs through interleaved image-text inputs. LLaVA (H. Liu et al. 2023, 2024; Y. J. Lee et al. 2024) established simple yet effective approaches for visual instruction tuning. CogVLM (Weihan Wang et al. 2024) introduced visual expert modules for deep vision-language fusion, while Qwen-VL (Bai et al. 2023) demonstrated strong multilingual capabilities. Contemporary open-source VLMs that extend or compete with these lines include InternVL (2024, high-resolution scaling), IDEFICS (2023-2024, open Flamingo reproduction), MiniGPT-4/v2 (2023, early LLaVA-style open demonstrator), mPLUG-Owl (2023-2024, modular design with visual abstractor), and PaliGemma (2024, Google's open SigLIP + Gemma combination).

Efficient VLMs for deployment include MobileVLM (Chu et al. 2023), TinyGPT-V (Yuan et al. 2024), and LLaVA-Phi (Zhu et al. 2024); see Efficient Vision-Language Models. Frontier systems like GPT-4V (OpenAI 2023) and Gemini (Team et al. 2023, 2024) represent the state of the art in multimodal understanding. Additional closed frontier systems include Claude 3.5 Sonnet (Anthropic), GPT-4o (OpenAI, successor to GPT-4V with stronger multimodal integration), and Grok-Vision (xAI). The closed-vs-open gap has narrowed substantially: as of late 2024, the best open models (Qwen2-VL-72B, InternVL-2-76B) match or exceed GPT-4V on most academic benchmarks, though closed models retain a lead in qualitative usability, reasoning depth, and safety behavior.

Recent systematic surveys include Zhang et al.'s "Vision-Language Models for Vision Tasks: A Survey" (G. Li et al. 2025) and the MMLMs survey by Yin et al., which provide broader coverage complementing the focus here. Reading pathways: for contrastive pre-training, start with CLIP and follow forward through SigLIP; for generative VLMs, follow BLIP -> BLIP-2 -> LLaVA -> LLaVA-NeXT; for frontier systems, read Gemini and GPT-4V system cards alongside technical reports.

SoTA Leaderboard

Selected vision-language milestones from CLIP (2021) to frontier multimodal assistants (2024). "Type" labels three families: contrastive (C), generative/captioning (G), and instruction-tuned multimodal (I). Vision-encoder column lists the image tower (ViT sizes follow the standard B/L/H/g/G/E convention; "/14" = patch 14). LLM column gives the language tower for VLMs that wire one in (contrastive models leave it blank). "Train data" reports the headline pre-training pair count. "IN 0-shot" is ImageNet-1k zero-shot top-1 (where applicable; instruction VLMs are not designed for this task and read - ). "COCO R@1" is image-to-text retrieval on MS-COCO 5K (zero-shot for contrastive models; - where the model targets generation only). "VQAv2" is test-dev accuracy (zero-shot reported where the paper exposed it; otherwise fine-tuned). "GQA" is the test-dev split. Citations point to the canonical paper for each model. Numbers reflect the largest configuration each paper released; smaller siblings (e.g., LLaVA-1.5-7B vs 13B) pick the larger to keep rows comparable.
Method	Year	Type	Vision enc.	LLM	Train data	IN 0-shot	VQAv2	COCO R@1
CLIP ViT-L/14-336 (Radford et al. 2021)	2021	C	ViT-L/14	-	WIT 400M	76.2	-	58.4
ALIGN (Jia et al. 2021)	2021	C	EfficientNet-L	BERT-Large	ALIGN 1.8B	76.4	-	58.6
FLIP ViT-L/14 (Radford et al. 2021)	2023	C	ViT-L/14	-	LAION-400M	74.6	-	54.8
OpenCLIP ViT-G/14 (Radford et al. 2021)	2022	C	ViT-G/14	-	LAION-2B	80.1	-	67.3
EVA-CLIP-E/14+ (Radford et al. 2021)	2023	C	ViT-E/14	-	LAION-2B+	82.0	-	71.1
SigLIP SoViT-400m (Zhai et al. 2023)	2023	C	SoViT-400m/14	-	WebLI 10B	83.0	-	70.6
Chinese-CLIP ViT-H (Yang et al. 2023)	2022	C	ViT-H/14	RoBERTa-wwm	LAION+Wukong	-	-	-
BLIP ViT-L (J. Li et al. 2022)	2022	G	ViT-L/16	BERT-Large	129M	-	78.2	65.1
CoCa-2.1B (Radford et al. 2021)	2022	C+G	ViT-g	dec. 1.1B	ALIGN+JFT-3B	86.3	82.3	66.3
BLIP-2 (FlanT5-XXL) (J. Li et al. 2023)	2023	G	EVA-ViT-g	FlanT5-XXL 11B	129M	-	65.0	-
BEiT-3 (Radford et al. 2021)	2022	G	ViT-g (BEiT-3)	BEiT-3 dec.	35M	-	84.0	76.0
Flamingo-80B (Alayrac et al. 2022)	2022	G+I	NFNet-F6	Chinchilla 70B	M3W 43M+	-	82.0	-
IDEFICS-80B (Alayrac et al. 2022)	2023	G+I	OpenCLIP H/14	LLaMA-65B	OBELICS 141M	-	60.0	-
Florence-2-Large (Radford et al. 2021)	2023	G	DaViT-Large	BART-base dec.	FLD-5B	-	-	-
MiniGPT-4 (H. Liu et al. 2023)	2023	I	EVA-ViT-g	Vicuna-13B	LAION+CC+SBU	-	48.5	-
LLaVA-1.5-13B (H. Liu et al. 2024)	2023	I	CLIP ViT-L-336	Vicuna-13B	1.2M	-	80.0	-
InstructBLIP-13B (Dai et al. 2023)	2023	I	EVA-ViT-g	Vicuna-13B	26 datasets	-	-	-
Qwen-VL-Chat (Bai et al. 2023)	2023	I	ViT-bigG	Qwen-7B	1.4B+7M+350K	-	78.2	-
CogVLM-17B (Weihan Wang et al. 2024)	2023	I	EVA2-CLIP-E	Vicuna-7B	1.5B	-	82.3	-
LLaVA-NeXT-34B (Y. J. Lee et al. 2024)	2024	I	CLIP ViT-L-336	Nous-Hermes-Yi-34B	1.3M	-	83.7	-
GPT-4V (OpenAI 2023)	2023	I	undisclosed	GPT-4	undisclosed	-	77.2	-
Gemini-1.5 Pro (Team et al. 2024)	2024	I	undisclosed	Gemini Pro	undisclosed	-	73.2	-

GQA was dropped to keep the table within 14 cm; instruction-tuned VLM GQA scores cluster between 49 and 67 (InstructBLIP-13B 49.5, Qwen-VL-Chat 57.5, LLaVA-1.5-13B 63.3, CogVLM 65.2, LLaVA-NeXT-34B 67.1) and are visited per-model in the chapter sections below. The " - " entries are not failures: contrastive dual-encoders cannot answer free-form VQA questions without a generative head; instruction VLMs are not optimized for ImageNet zero-shot prompting; closed frontier systems (GPT-4V, Gemini) do not publish architectural details.[fn:s3-problem-vlm-comparability]

The contrastive era (rows 1-7) opens with CLIP, whose ViT-L/14-336 set the 76.2% ImageNet zero-shot bar that anchored the field for two years. ALIGN matched it the same year using 1.8B noisy web pairs, validating that curation could be replaced by scale. FLIP introduced patch-level masking to halve training compute at modest accuracy cost. OpenCLIP reproduced the recipe openly on LAION-2B, climbing to 80.1% with the ViT-G/14 model and giving the academic community its first matched-spec CLIP variant. EVA-CLIP layered MIM-pretrained vision backbones onto LAION to push to 82.0%, while SigLIP swapped softmax InfoNCE for sigmoid loss and reached 83.0% on WebLI-10B; the SigLIP recipe has since become the default vision tower for downstream VLMs (PaliGemma, LLaVA-NeXT-Sigmoid). Chinese-CLIP (row 7) extended the pattern to non-English, with bilingual retrieval scores no English-trained CLIP can match.

The generative and bootstrapping era (rows 8-14) introduced unified objectives for retrieval, captioning, and VQA simultaneously. BLIP's CapFilt cleaned web captions and reached 78.2 VQAv2 with only 129M training pairs. CoCa fused contrastive and captioning losses in one decoder, hitting 86.3 ImageNet zero-shot (the contrastive ceiling at the time) and 82.3 fine-tuned VQAv2. BLIP-2's Q-Former showed that frozen-encoder/frozen-LLM bridging trains for a fraction of full-finetune cost and still reaches 65.0 zero-shot VQAv2 with FlanT5-XXL. BEiT-3 unified vision and language as a single Multiway Transformer, taking 84.0 fine-tuned VQAv2 with only 35M training pairs by reusing the same backbone for image, text, and image-text inputs. Flamingo and its open clone IDEFICS-80B introduced gated cross-attention and Perceiver resamplers to enable few-shot multimodal in-context learning, while Florence-2 went the opposite direction (dense prediction unification with DaViT and a BART-style decoder).

The instruction era (rows 15-22) replaced fine-tuning with visual instruction tuning. MiniGPT-4 showed the basic recipe, LLaVA-1.5 cleaned it up with a 2-layer MLP projector and 336px CLIP, reaching 80.0 VQAv2 on a 1.2M-sample budget. InstructBLIP turned 26 datasets into instruction format atop the BLIP-2 Q-Former. Qwen-VL added bounding-box supervision and bilingual training. CogVLM introduced "visual expert" parallel layers (separate QKV/FFN per Transformer block) for deeper fusion, reaching 82.3 VQAv2 without forgetting language ability. LLaVA-NeXT scaled to dynamic resolution and 34B LLM backbones (83.7 VQAv2). The closed frontier (GPT-4V, Gemini-1.5 Pro) reports lower academic VQAv2 (77.2 / 73.2) than the best open-source models on the same benchmark, but holds qualitative leads on long-context, document, and reasoning evaluations not tabulated here; the closed-vs-open gap that mattered most in 2023 has been substantially closed by late 2024 on extractive benchmarks while remaining wide on holistic ones.

References

Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, et al. 2022. "Flamingo: A Visual Language Model for Few-Shot Learning." arXiv. https://doi.org/10.48550/arXiv.2204.14198.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. "Neural Machine Translation by Jointly Learning to Align and Translate." In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1409.0473.

Bai, Jinze, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv. https://doi.org/10.48550/arXiv.2308.12966.

Balaji, Yogesh, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, et al. 2023. "eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers." arXiv. https://doi.org/10.48550/arXiv.2211.01324.

Bao, Fan, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. "All Are Worth Words: A ViT Backbone for Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2209.12152.

Betker, James, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, et al. 2023. "Improving Image Generation with Better Captions."

Blattmann, Andreas, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, et al. 2023. "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." arXiv. https://doi.org/10.48550/arXiv.2311.15127.

Blattmann, Andreas, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 2023. "Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2304.08818.

Brooks, Tim, Aleksander Holynski, and Alexei A. Efros. 2023. "InstructPix2Pix: Learning to Follow Image Editing Instructions." arXiv. https://doi.org/10.48550/arXiv.2211.09800.

Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. 2016. "Importance Weighted Autoencoders." In International Conference on Learning Representations. arXiv. https://doi.org/10.48550/arXiv.1509.00519.

Chang, Huiwen, Han Zhang, Jarred Barber, A. J. Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, et al. 2023. "Muse: Text-To-Image Generation via Masked Generative Transformers." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.2301.00704.

Chang, Huiwen, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. 2022. "MaskGIT: Masked Generative Image Transformer." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.2202.04200.

Chen, Junsong, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, et al. 2023. "PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis." arXiv. https://doi.org/10.48550/arXiv.2310.00426.

Chen, Mark, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. "Generative Pretraining From Pixels." In Proceedings of the 37th International Conference on Machine Learning, 1691-1703. PMLR. https://proceedings.mlr.press/v119/chen20s.html.

Chu, Xiangxiang, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, et al. 2023. "MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices." arXiv. https://doi.org/10.48550/arXiv.2312.16886.

Dai, Wenliang, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. "InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning." arXiv. https://doi.org/10.48550/arXiv.2305.06500.

De Bortoli, Valentin, James Thornton, Jeremy Heng, and Arnaud Doucet. 2021. "Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling." In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2106.01357.

Dhariwal, Prafulla, and Alex Nichol. 2021. "Diffusion Models Beat GANs on Image Synthesis." arXiv. https://doi.org/10.48550/arXiv.2105.05233.

Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. 2017. "Density Estimation Using Real NVP." arXiv. https://doi.org/10.48550/arXiv.1605.08803.

Durkan, Conor, Artur Bekasov, Iain Murray, and George Papamakarios. 2019. "Neural Spline Flows." arXiv. https://doi.org/10.48550/arXiv.1906.04032.

Esser, Patrick, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, et al. 2024. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." arXiv. https://doi.org/10.48550/arXiv.2403.03206.

Gafni, Oran, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors." In European Conference on Computer Vision (ECCV). arXiv. https://doi.org/10.48550/arXiv.2203.13131.

Ge, Songwei, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. 2022. "Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer." In European Conference on Computer Vision (ECCV). arXiv. https://doi.org/10.48550/arXiv.2204.03638.

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. "Generative Adversarial Networks." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.1406.2661.

Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. "Improved Training of Wasserstein GANs." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.1704.00028.

Guo, Yuwei, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2024. "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models Without Specific Tuning." arXiv. https://doi.org/10.48550/arXiv.2307.04725.

Han, Jian, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. 2025. "Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis." arXiv. https://doi.org/10.48550/arXiv.2412.04431.

Henighan, Tom, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, et al. 2020. "Scaling Laws for Autoregressive Generative Modeling." arXiv. https://doi.org/10.48550/arXiv.2010.14701.

Hertz, Amir, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. "Prompt-to-Prompt Image Editing with Cross Attention Control." arXiv. https://doi.org/10.48550/arXiv.2208.01626.

Higgins, Irina, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. "Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." In. https://openreview.net/forum?id=Sy2fzU9gl.

Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. 2006. "Reducing the Dimensionality of Data with Neural Networks." Science 313 (5786): 504-7.

Ho, Jonathan, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, et al. 2022. "Imagen Video: High Definition Video Generation with Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2210.02303.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. "Denoising Diffusion Probabilistic Models." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.2006.11239.

Ho, Jonathan, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. 2022. "Video Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2204.03458.

Hoogeboom, Emiel, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. 2022. "Autoregressive Diffusion Models." In International Conference on Learning Representations. arXiv. https://doi.org/10.48550/arXiv.2110.02037.

Hooker, Sara. 2020. "The Hardware Lottery." arXiv. https://doi.org/10.48550/arXiv.2009.06489.

Hyvärinen, Aapo. 2005. "Estimation of Non-Normalized Statistical Models by Score Matching." Journal of Machine Learning Research (JMLR) 6: 695-709.

Jia, Chao, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision." arXiv. https://doi.org/10.48550/arXiv.2102.05918.

Jordan, Michael I., Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. "An Introduction to Variational Methods for Graphical Models." Machine Learning 37 (2): 183-233.

Karras, Tero, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. "Progressive Growing of GANs for Improved Quality, Stability, and Variation." In International Conference on Learning Representations. arXiv. https://doi.org/10.48550/arXiv.1710.10196.

Karras, Tero, Miika Aittala, Timo Aila, and Samuli Laine. 2022. "Elucidating the Design Space of Diffusion-Based Generative Models." arXiv. https://doi.org/10.48550/arXiv.2206.00364.

Karras, Tero, Samuli Laine, and Timo Aila. 2019. "A Style-Based Generator Architecture for Generative Adversarial Networks." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.1812.04948.

Kingma, Diederik P., and Prafulla Dhariwal. 2018. "Glow: Generative Flow with Invertible 1x1 Convolutions." arXiv. https://doi.org/10.48550/arXiv.1807.03039.

Kingma, Diederik P., and Max Welling. 2014. "Auto-Encoding Variational Bayes." In International Conference on Learning Representations. arXiv. https://doi.org/10.48550/arXiv.1312.6114.

Kulkarni, Girish, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. "Baby Talk: Understanding and Generating Image Descriptions." In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2011.5995466.

Lee, Doyup, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. 2022. "Autoregressive Image Generation Using Residual Quantization." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.2203.01941.

Lee, Yong Jae, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Haotian Liu. 2024. "LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge." LLaVA. https://llava-vl.github.io/blog/2024-01-30-llava-next/.

Li, Guangyuan, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, and Joni Pajarinen. 2025. "Object-Centric Vision Token Pruning for Vision Language Models." https://doi.org/10.48550/arXiv.2511.20439.

Li, Junnan, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. "BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models." arXiv. https://doi.org/10.48550/arXiv.2301.12597.

Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. "BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation." arXiv. https://doi.org/10.48550/arXiv.2201.12086.

Li, Tianhong, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan. 2023. "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.2211.09117.

Lipman, Yaron, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. "Flow Matching for Generative Modeling." arXiv. https://doi.org/10.48550/arXiv.2210.02747.

Liu, Haotian, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. "Improved Baselines with Visual Instruction Tuning." arXiv. https://doi.org/10.48550/arXiv.2310.03744.

Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. "Visual Instruction Tuning." arXiv. https://doi.org/10.48550/arXiv.2304.08485.

Liu, Jinlai, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, and Zehuan Yuan. 2025. "InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.2511.04675.

Lu, Cheng, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. 2022. "DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models." Machine Intelligence Research 22 (4): 730-51. https://doi.org/10.1007/s11633-025-1562-4.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Distributed Representations of Words and Phrases and Their Compositionality." In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.

Nichol, Alex, and Prafulla Dhariwal. 2021. "Improved Denoising Diffusion Probabilistic Models." arXiv. https://doi.org/10.48550/arXiv.2102.09672.

Nichol, Alex, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2112.10741.

Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016a. "Pixel Recurrent Neural Networks." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1601.06759.

Oord, Aaron van den, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. "Conditional Image Generation with PixelCNN Decoders." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.1606.05328.

Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. 2017. "Neural Discrete Representation Learning." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.1711.00937.

Oord, Aäron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. "WaveNet: A Generative Model for Raw Audio." https://arxiv.org/abs/1609.03499.

Oord, Aäron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016b. "Pixel Recurrent Neural Networks." In International Conference on Machine Learning (ICML). https://arxiv.org/abs/1601.06759.

Oord, Aäron van den, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. "Conditional Image Generation with PixelCNN Decoders." In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1606.05328.

OpenAI. 2023. "GPT-4V(ision) System Card." OpenAI Technical Report. https://cdn.openai.com/papers/GPTV_System_Card.pdf.

Oquab, Maxime, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, et al. 2024. "DINOv2: Learning Robust Visual Features Without Supervision." arXiv. https://doi.org/10.48550/arXiv.2304.07193.

Papamakarios, George, Theo Pavlakou, and Iain Murray. 2018. "Masked Autoregressive Flow for Density Estimation." arXiv. https://doi.org/10.48550/arXiv.1705.07057.

Peebles, William, and Saining Xie. 2023. "Scalable Diffusion Models with Transformers." arXiv. https://doi.org/10.48550/arXiv.2212.09748.

Podell, Dustin, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis." arXiv. https://doi.org/10.48550/arXiv.2307.01952.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. "Learning Transferable Visual Models From Natural Language Supervision." arXiv. https://doi.org/10.48550/arXiv.2103.00020.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. "Language Models Are Unsupervised Multitask Learners." OpenAI Technical Report.

Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. "Hierarchical Text-Conditional Image Generation with CLIP Latents." arXiv. https://doi.org/10.48550/arXiv.2204.06125.

Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. "Zero-Shot Text-to-Image Generation." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.2102.12092.

Razavi, Ali, Aaron van den Oord, and Oriol Vinyals. 2019. "Generating Diverse High-Fidelity Images with VQ-VAE-2." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.1906.00446.

Reynolds, Douglas A. 2009. "Gaussian Mixture Models." In Encyclopedia of Biometrics, edited by Stan Z. Li and Anil K. Jain, 659-63. Springer.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2015. "Variational Inference with Normalizing Flows." In International Conference on Machine Learning (ICML). arXiv. https://doi.org/10.48550/arXiv.1505.05770.

Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. "High-Resolution Image Synthesis with Latent Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2112.10752.

Saharia, Chitwan, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, et al. 2022. "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding." arXiv. https://doi.org/10.48550/arXiv.2205.11487.

Salimans, Tim, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. 2017. "PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications." In International Conference on Learning Representations. arXiv. https://doi.org/10.48550/arXiv.1701.05517.

Song, Jiaming, Chenlin Meng, and Stefano Ermon. 2021. "Denoising Diffusion Implicit Models." In International Conference on Learning Representations. arXiv. https://doi.org/10.48550/arXiv.2010.02502.

Song, Yang, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. 2023. "Consistency Models." arXiv. https://doi.org/10.48550/arXiv.2303.01469.

Song, Yang, and Stefano Ermon. 2020. "Generative Modeling by Estimating Gradients of the Data Distribution." arXiv. https://doi.org/10.48550/arXiv.1907.05600.

Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. "Score-Based Generative Modeling Through Stochastic Differential Equations." In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2011.13456.

"SORA: Video Generation Models as World Simulators." 2024. https://openai.com/index/video-generation-models-as-world-simulators/.

Sun, Peize, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. 2024. "Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation." arXiv. https://doi.org/10.48550/arXiv.2406.06525.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. "Sequence to Sequence Learning with Neural Networks." In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1409.3215.

Team, Gemini, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, et al. 2023. "Gemini: A Family of Highly Capable Multimodal Models." arXiv. https://doi.org/10.48550/arXiv.2312.11805.

Team, Gemini, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, et al. 2024. "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context." arXiv. https://doi.org/10.48550/arXiv.2403.05530.

Tian, Keyu, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.2404.02905.

Vahdat, Arash, and Jan Kautz. 2020. "NVAE: A Deep Hierarchical Variational Autoencoder." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.2007.03898.

Villegas, Ruben, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. 2023. "Phenaki: Variable Length Video Generation From Open Domain Textual Description." In International Conference on Learning Representations. arXiv. https://doi.org/10.48550/arXiv.2210.02399.

Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. "Extracting and Composing Robust Features with Denoising Autoencoders." In International Conference on Machine Learning (ICML).

Voleti, Vikram, Alexia Jolicoeur-Martineau, and Christopher Pal. 2022. "MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation." In Advances in Neural Information Processing Systems. arXiv. https://doi.org/10.48550/arXiv.2205.09853.

Wang, Weihan, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, et al. 2024. "CogVLM: Visual Expert for Pretrained Language Models." arXiv. https://doi.org/10.48550/arXiv.2311.03079.

Wang, Weilun, Jianmin Bao, Wengang Zhou, Dongdong Chen, Dong Chen, Lu Yuan, and Houqiang Li. 2023. "Semantic Image Synthesis via Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2207.00050.

Xue, Le, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, et al. 2024. "xGen-MM (BLIP-3): A Family of Open Large Multimodal Models." arXiv. https://doi.org/10.48550/arXiv.2408.08872.

Yan, Wilson, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. 2021. "VideoGPT: Video Generation Using VQ-VAE and Transformers." arXiv. https://doi.org/10.48550/arXiv.2104.10157.

Yang, An, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. 2023. "Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese." arXiv. https://doi.org/10.48550/arXiv.2211.01335.

Yu, Jiahui, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, et al. 2022. "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation." arXiv. https://doi.org/10.48550/arXiv.2206.10789.

Yuan, Zhengqing, Zhaoxu Li, Weiran Huang, Yanfang Ye, and Lichao Sun. 2024. "TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones." arXiv. https://doi.org/10.48550/arXiv.2312.16862.

Zhai, Xiaohua, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. "Scaling Vision Transformers." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.2106.04560.

Zhai, Xiaohua, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. "Sigmoid Loss for Language Image Pre-Training." arXiv. https://doi.org/10.48550/arXiv.2303.15343.

Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. 2023. "Adding Conditional Control to Text-to-Image Diffusion Models." arXiv. https://doi.org/10.48550/arXiv.2302.05543.

Zhu, Yichen, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. 2024. "LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model." arXiv. https://doi.org/10.48550/arXiv.2401.02330.

Zhuang, Xianwei, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, and Yuexian Zou. 2025. "VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model." arXiv. https://doi.org/10.48550/arXiv.2501.12327.