Computer Vision - Adnan Harun Dogan

CNN-Based Object Detection

Object detection localizes and classifies multiple objects in images via bounding boxes. Across roughly six years (2014-2020) the literature settled into five overlapping families; two-stage refinement, single-shot dense prediction, focal-loss balanced one-stage, anchor-free per-pixel and keypoint heads, and set-prediction transformers; each occupying a different point on the accuracy/speed/simplicity frontier. Each family also addressed a characteristic set of well-known problems (scale, imbalance, NMS heuristics, anchor design, slow convergence, etc.) walked through paragraph-by-paragraph after the SoTA leaderboard below.

The historical context matters because the same problems recur under new names. Pre-deep-learning detection plateaued near 33% mAP on PASCAL VOC under HOG-feature pipelines¹ (Dalal and Triggs 2005) and the deformable part model (Felzenszwalb, McAllester, and Ramanan 2008) (a sliding-window root + parts mixture trained with latent-SVM that defined the pre-CNN state of the art for almost a decade); R-CNN (Girshick et al. 2014) (2014) lifted it to 53.3% by reusing ImageNet (Deng et al. 2009) pretrained CNN features on selective-search proposals, but at ~49 seconds per image. Fast R-CNN (Girshick 2015) (2015) shared backbone computation across proposals for a 213 $\times$ speedup; Faster R-CNN (Ren et al. 2015) (2015) replaced the external proposal stage with a learned Region Proposal Network. Single-shot detectors emerged in parallel: YOLO (Redmon et al. 2016) (2016) reframed detection as direct regression on a 7 $\times$ 7 grid at 45 FPS, SSD (W. Liu et al. 2016) (2016) added multi-scale feature maps. RetinaNet (2017) closed the one-stage/two-stage accuracy gap with focal loss (Lin, Goyal, et al. 2017), while FPN (Lin, Dollár, et al. 2017) became the universal neck. Anchor-free methods (CornerNet (Law and Deng 2018), FCOS (Tian et al. 2019), CenterNet (Zhou, Wang, and Krähenbühl 2019)) eliminated the anchor-design hyperparameters in 2019; DETR (Carion et al. 2020) removed NMS via Hungarian matching in 2020 (covered in the transformer chapter).

The five families differ on which problems they accept and which they engineer away. Two-stage detectors handle the foreground/background imbalance via balanced sampling at the RoI head, but inherit a complex multi-stage pipeline and slow inference. One-stage detectors face the imbalance directly through hard mining (SSD), focal loss (RetinaNet), or implicit balanced anchors (YOLO grid). Anchor-free heads sidestep the entire scale-and-aspect anchor-tuning enterprise but still need a label-assignment heuristic (smallest-area box, centerness, etc.). Set-prediction transformers bypass NMS at the cost of 500-epoch training. No family dominates: the COCO leaderboard top is a moving boundary between Cascade-style two-stage refinement, EfficientDet-style compound-scaled one-stage, and DINO-DETR-style transformers, and the production-grade default depends on whether latency, throughput, or peak accuracy is the primary constraint² ³.

CNN object-detection SoTA leaderboard. Method links to the chapter section, with citation attached. Year is the conference/arXiv year. Backbone is the canonical reported configuration; Head/Loss summarises the detection head and the canonical regression/classification loss with key hyperparameters in parentheses. Data names the training set (VOC07+12 trainval / COCO trainval (Lin et al. 2014) / OpenImages). VOC = PASCAL VOC 2007 test mAP at IoU=0.5. COCO columns: AP is the mean over IoU thresholds 0.5:0.05:0.95 on val2017 single-scale; AP50 at IoU=0.5; AP_S, AP_M, AP_L are the standard small (<32² px) / medium (32-96² px) / large (>96² px) splits. FPS is the original paper's reported speed on its own GPU (K40, Titan X, V100, A100); cross-family comparisons unreliable (see fn ref in prose). Dashes mark unreported settings. RPN+RoI(softmax) denotes Faster-style two-stage classification; Focal denotes RetinaNet-style focal cross-entropy; Hungarian denotes DETR-style set matching.
					VOC	COCO					Speed
Method	Year	Backbone	Head/Loss	Data	mAP	AP	AP50	AP_S	AP_M	AP_L	FPS
R-CNN (Girshick et al. 2014)	2014	AlexNet	SVM+ridge-reg	VOC07+12	53.3	-	-	-	-	-	0.05
Fast R-CNN (Girshick 2015)	2015	VGG-16	RoI(softmax)+sm-L1	VOC07+12	70.0	19.7	35.9	-	-	-	0.5
Faster R-CNN (Ren et al. 2015)	2015	VGG-16	RPN+RoI(softmax)+sm-L1	VOC07+12	73.2	21.9	42.7	-	-	-	5
YOLO v1 (Redmon et al. 2016)	2016	Darknet (custom)	grid-MSE( $\lambda_c$ =5, $\lambda_n$ =0.5)	VOC07+12	63.4	-	-	-	-	-	45
SSD-512 (W. Liu et al. 2016)	2016	VGG-16	hard-mining(3:1)+sm-L1	VOC07+12	76.8	28.8	48.5	10.9	31.8	43.5	22
FPN (Faster R-CNN) (Lin, Dollár, et al. 2017)	2017	ResNet-101	RPN+RoI(softmax)+sm-L1	COCO	-	36.2	59.1	18.2	39.0	48.2	5
Mask R-CNN (He et al. 2017)	2017	ResNet-101 + FPN	RPN+RoI(softmax)+sm-L1+mask	COCO	-	39.8	62.3	22.1	43.2	51.2	5
RetinaNet (Lin, Goyal, et al. 2017)	2017	ResNet-101 + FPN	Focal( $\gamma$ =2, $\alpha$ =0.25)+sm-L1	COCO	-	39.1	59.1	21.8	42.7	50.2	11
YOLO v3 (Redmon and Farhadi 2018)	2018	Darknet-53	BCE(per-anchor obj)+sm-L1	COCO	-	33.0	57.9	18.3	35.4	41.9	35
Cascade R-CNN (Cai and Vasconcelos 2018)	2018	ResNet-101 + FPN	3-head cascade(0.5/0.6/0.7)	COCO	-	42.8	62.1	23.7	45.5	55.2	7
PANet (Shu Liu et al. 2018)	2018	ResNeXt-101	PAFPN+RoI(softmax)+sm-L1	COCO	-	47.4	67.2	27.2	51.0	60.0	4
CornerNet (Law and Deng 2018)	2019	Hourglass-104	Focal-heatmap+pull/push	COCO	-	42.1	57.8	20.8	44.8	56.7	4
FCOS (Tian et al. 2019)	2019	ResNet-101 + FPN	center-ness+GIoU+focal-cls	COCO	-	41.5	60.7	24.4	44.8	51.6	17
CenterNet (Zhou, Wang, and Krähenbühl 2019)	2019	Hourglass-104	Gaussian-heatmap+L1(size)	COCO	-	42.1	61.1	19.9	43.0	51.4	7
EfficientDet-D7 (Tan, Pang, and Le 2020)	2020	EfficientNet-B6	BiFPN+Focal+sm-L1	COCO	-	53.7	72.4	35.8	57.0	66.3	6
YOLO v4 (Bochkovskiy, Wang, and Liao 2020)	2020	CSPDarknet-53	CIoU+BCE-obj+Mosaic-aug	COCO	-	43.5	65.7	26.7	46.7	53.3	65
DETR (Carion et al. 2020)	2020	ResNet-101	Hungarian+CE+L1+GIoU(5,2)	COCO	-	44.9	64.7	23.7	49.5	62.3	28
Deformable DETR (Zhu et al. 2021)	2021	ResNet-50	Hungarian+CE+L1+GIoU+def-att	COCO	-	46.2	65.2	28.8	49.2	61.7	19
DINO-DETR (H. Zhang et al. 2023)	2022	Swin-L	Hungarian+denoising+contrast	COCO+O365	-	58.5	77.1	41.5	62.7	74.0	10
Co-DETR (Zong, Song, and Liu 2023)	2023	ViT-L (MAE)	hybrid Hungarian+ATSS+Faster	COCO+O365	-	66.0	-	-	-	-	-

The leaderboard separates six eras and a speed column whose meaning differs from the accuracy columns. Reading the table from left to right gives the family-by-family progression (R-CNN $\to$ one-stage $\to$ Focal $\to$ anchor-free $\to$ compound-scaled $\to$ transformer); reading top-to-bottom by Year reveals the COCO AP trajectory from 19.7 (Fast R-CNN, 2015) to 66.0 (Co-DETR, 2023), a 3.4 $\times$ gain spread across backbone scaling⁴, pyramid sophistication⁵, and head redesign. Three diagnostics. First, AP_S almost always lags AP_M and AP_L by 15-25 points, the single most persistent failure mode of CNN detectors⁶. Second, FPS within a family is reliable but cross-family comparisons span four GPU generations⁷. Third, the gap between best two-stage and best one-stage inverted between 2018 and 2020, eliminating the historical "two-stage = accuracy, one-stage = speed" dichotomy⁸.

The two-stage R-CNN family defines the era. R-CNN ran 2000 selective-search proposals through an AlexNet/VGG forward pass each, so per-image inference at ~49s was untenable; Fast R-CNN shared a single backbone pass and replaced the per-class SVM cascade with a softmax + smooth-L1 multi-task head, while Faster R-CNN folded proposals into a learned RPN sliding 9 hand-tuned anchors per location⁹. Mask R-CNN extended the same scaffolding with a per-class binary mask branch and replaced quantising RoIPool with bilinear RoIAlign; the pixel-precise alignment matters far more at strict IoU thresholds¹⁰. Cascade R-CNN refines proposals through three heads at IoU thresholds 0.5/0.6/0.7, exploiting the quality-mismatch observation that a single head cannot simultaneously be accurate at all IoU regimes; this is also where label assignment becomes structurally explicit¹¹.

The single-shot SSD/YOLO line trades the proposal stage for direct dense prediction. YOLO partitioned the image into a 7 $\times$ 7 grid where each cell predicts $B=2$ boxes plus class probabilities at 45 FPS but only 63.4 VOC mAP, paying for speed in localization (19% of YOLO's error is localization vs Fast R-CNN's 8%); SSD recovered accuracy by attaching prediction heads to multiple feature-map resolutions (38 $\times$ 38 through 1 $\times$ 1) so that small objects map to high-resolution layers and large objects to coarse layers, a structurally different solution to scale variance¹². SSD also introduced 3:1 hard negative mining as its blunt-instrument response to foreground/background imbalance¹³, and depends notoriously hard on data augmentation (74.3 $\to$ 65.5 mAP without augmentation). YOLOv3 added k-means anchor priors and FPN-style three-scale prediction; YOLOv4 split engineering into "Bag of Freebies" (Mosaic, CIoU, label smoothing) and "Bag of Specials" (Mish, DIoU-NMS, PANet neck), reaching 43.5 COCO AP at 65 FPS by combining many small recipes rather than one architectural insight.

RetinaNet + FPN closed the one-stage/two-stage accuracy gap with two orthogonal contributions. Focal loss $-(1-p_t)^\gamma \log p_t$ with $\gamma=2,\alpha=0.25$ reweights cross-entropy by prediction confidence so that the cumulative loss from ~100k easy negatives no longer drowns ~100 hard positives¹⁴, and the classification head's bias is initialized to $-\log((1-\pi)/\pi),\ \pi=0.01$ to make the initial logits match the empirical class prior. Focal loss beat OHEM because it uses every negative with smooth down-weighting rather than a hard cutoff. Crucially, RetinaNet sat on top of FPN (Lin, Dollár, et al. 2017), whose top-down + lateral pathway propagates semantic strength to high-resolution feature levels at single-pass cost; AP_S rose from 14.1 to 19.9 (+5.8) on COCO purely from FPN, and FPN became the universal neck across Faster R-CNN, Mask R-CNN, RetinaNet, FCOS, and EfficientDet¹⁵.

CNN detection family evolution from 2015 R-CNN-era detectors to 2023 transformer-based DETR variants, tracking COCO AP vs publication year. Marker shape and colour encode family (two-stage / one-stage / anchor-free / transformer). Three structural observations match the table reading: (i) the two-stage and one-stage clusters stayed within 5 AP of each other from 2017 onward; (ii) anchor-free detectors landed on the same Pareto-front in 2019 without delivering a step change, suggesting that anchor design was a hyperparameter inconvenience rather than a fundamental capacity limit; (iii) transformer-based set-prediction (DETR $\to$ Deformable $\to$ DINO $\to$ Co-DETR) opens a clear gap from 2022 onward, lifting COCO AP from 47 (best CNN) to 66 (Co-DETR), a 19-AP improvement driven by larger backbones (Swin-L, ViT-L MAE) and richer training (denoising, contrastive auxiliaries).

The anchor-free turn (CornerNet, FCOS, CenterNet) made a structural rather than empirical case. FCOS predicts $(l,t,r,b)$ distances at every foreground pixel of the FPN levels and resolves multi-GT overlap by assigning to the smallest-area box; the only heuristic the paper retains¹⁶. The centerness branch $\sqrt{\min(l,r)/\max(l,r) \cdot \min(t,b)/\max(t,b)}$ suppresses low-quality peripheral predictions for +3.6 AP and is a cheap analogue of objectness. CornerNet detects top-left and bottom-right corner heatmaps with associative-embedding grouping, and CenterNet reduces this further to a single Gaussian-heatmap center per object plus a $(w,h)$ regression. Anchor-free heads sidestepped hand-tuned anchor scales/ratios entirely¹⁷ and proved that proper loss weighting plus per-level FPN assignment closes the gap; ATSS later showed that adaptive label assignment (mean+std of per-object IoU as the dynamic threshold) brings anchor-based and anchor-free recipes to within 0.1 AP, confirming that the anchor itself was never the load-bearing piece¹⁸.

The transformer set-prediction line, covered fully in the next chapter, starts with DETR replacing dense prediction + NMS with 100 learned object queries and Hungarian bipartite matching against the ground-truth set. NMS disappears because each query commits to one GT, eliminating the threshold-sensitive greedy suppression heuristic that limits CNN detectors in dense crowds¹⁹ ²⁰. The cost was a 500-epoch convergence schedule (versus 12-24 for Faster R-CNN); Deformable DETR's sparse multi-scale attention reduced this to 50 epochs, DAB-DETR/DN-DETR added content-aware queries and denoising auxiliaries, and DINO-DETR with Swin-L plus contrastive denoising reaches 58.5 COCO AP²¹. Co-DETR closed the loop by re-introducing one-to-many auxiliary heads (ATSS, Faster R-CNN) alongside the one-to-one Hungarian head during training, lifting Swin-L COCO val from 58.5 to 59.5 and ViT-L COCO test-dev to 66.0 AP, the current chapter ceiling.

The closing reading is that detection has not so much "solved" any of the listed problems as redistributed them. Scale variance moved from image pyramids into FPN; foreground/background imbalance moved from RPN sampling into focal loss into per-query bipartite assignment; anchor design was first adaptive (ATSS), then dropped (FCOS), then absent (DETR); NMS was softened (Soft-NMS, DIoU-NMS) and then removed (DETR). What remains stubborn is small-object AP (still 15-25 points below large-object AP across the entire table, even at Swin-L scale), cross-paper FPS comparability (every paper reports its own GPU)²², and ImageNet/Object365 pretraining dependence²³; the latter starts to dissolve only with self-supervised backbones (MAE, DINO) trained at 10 $^8$ + image scale²⁴.

Detection and Segmentation Transformers

Object detection has traditionally relied on two-stage detectors like Faster R-CNN or single-stage detectors like YOLO and RetinaNet. These methods require hand-designed components: anchor boxes for proposal generation, non-maximum suppression (NMS) for duplicate removal, and complex label assignment strategies. The introduction of DETR (Carion et al. 2020) fundamentally changed this paradigm by framing detection as a set prediction problem.

DETR uses a transformer encoder-decoder architecture with learned object queries and bipartite matching for training. While conceptually elegant, the original DETR suffered from slow convergence and difficulty with small objects. Subsequent work addressed these limitations through deformable attention (Zhu et al. 2021), improved query formulations (Shilong Liu et al. 2022), and denoising training strategies (F. Li et al. 2022).

Image segmentation has similarly been transformed by the mask classification paradigm. MaskFormer (Cheng, Schwing, and Kirillov 2021) showed that semantic, instance, and panoptic segmentation can be unified under a single framework. Mask2Former (Cheng et al. 2022) improved this with masked attention, achieving state-of-the-art across all segmentation tasks.

The Swin Transformer (Ze Liu et al. 2021) introduced hierarchical vision transformers with shifted window attention, providing efficient backbones for dense prediction tasks. Swin Transformer V2 (Ze Liu et al. 2022) scaled these models further with improved training stability.

Vision Transformers

Vision Transformers (ViT) represent a paradigm shift from CNNs to pure attention-based architectures. The foundational ViT paper (Dosovitskiy et al. 2021) demonstrated that a standard Transformer encoder, applied directly to sequences of image patches, achieves excellent results on image classification when pretrained on sufficient data (ImageNet-21K (Deng et al. 2009) or JFT-300M). ViT-Huge achieved 88.55% top-1 accuracy on ImageNet, matching state-of-the-art CNNs while using substantially fewer computational resources during training. The key insight: inductive biases of convolutions are not strictly necessary; transformers learn these from data.

Data efficiency became the focus of subsequent work. DeiT (Touvron et al. 2021) showed ViT can be trained on ImageNet-1K alone using strong regularization, achieving 81.8% top-1 accuracy without external data (83.1% with distillation from a RegNetY-16GF teacher). DeiT introduced knowledge distillation via a distillation token, demonstrating that ViT benefits from CNN-style inductive biases transferred through distillation. DeiT III (Touvron, Cord, and Jégou 2022) pushed this further with improved training recipes, achieving 85.2% with ViT-H/14.

Hierarchical designs addressed the computational limitations of global attention. Swin Transformer (Ze Liu et al. 2021) introduced shifted window attention, computing attention within local windows while enabling cross-window connections through shifting. This achieves linear complexity and provides multi-scale feature maps compatible with dense prediction. Swin achieved 87.3% on ImageNet and set new state-of-the-art on COCO detection (58.7 box AP) and ADE20K segmentation (53.5 mIoU). Swin Transformer V2 (Ze Liu et al. 2022) scaled to 3B parameters with techniques for training stability (residual post-normalization, scaled cosine attention).

Alternative hierarchical approaches emerged. Pyramid Vision Transformer (PVT) (Wang et al. 2021) uses spatial reduction attention to handle high-resolution feature maps. Twins (Chu et al. 2021) combines locally-grouped attention with global sub-sampled attention. Focal Transformer (Yang et al. 2021) introduces focal attention attending both locally and globally with varying granularity. Nested Hierarchical Transformer (NesT) (Z. Zhang et al. 2021) aggregates tokens in a nested manner to create hierarchical representations.

Hybrid architectures combine CNN and Transformer strengths. Convolutional vision Transformer (CvT) (Wu et al. 2021) introduces convolutions into attention projection. CoAtNet (Dai et al. 2021) systematically combines depthwise convolution and attention, achieving 90.88% ImageNet accuracy with JFT-3B pretraining. MaxViT (Tu et al. 2022) interleaves block and grid attention patterns with convolutions for efficient multi-scale attention. Mobile-Former (Y. Chen et al. 2022) uses parallel mobile and former branches with bidirectional connections. MobileViT (Mehta and Rastegari 2022) unfolds features for transformer processing while maintaining mobile efficiency.

Self-supervised learning proved particularly effective for ViT. DINO (Caron et al. 2021) showed self-supervised ViT features contain semantic segmentation information not present in supervised models, achieving 78.3% k-NN accuracy on ImageNet. A follow-up diagnostic, Vision Transformers Need Registers (Darcet et al. 2024), identified a structural artifact in large pre-trained ViTs: a small fraction of patch tokens develop anomalously high feature norms and attract disproportionate attention weight, producing irregular blobs in attention maps. These high-norm tokens appear preferentially at low-information background patches, not at semantically salient locations. The interpretation is that the model repurposes uninformative patch tokens as global scratch space to store and transmit information that does not fit naturally into the attention pattern of any single semantic patch.

The fix is minimal: append $r = 4$ additional learnable [REG] tokens to the input sequence (alongside the [CLS] token) and include them throughout all layers. These tokens have no spatial correspondence and no loss target; they serve purely as internal workspace. With registers present, the high-norm artifact disappears from patch tokens, attention maps become spatially smooth, and dense prediction improves. The intervention requires no change to the training recipe, the loss function, or the backbone architecture. The paper reports new state of the art for self-supervised models on dense visual prediction tasks and demonstrates improvements in object discovery quality with larger models. The caveat is that register count $r = 4$ is a hyperparameter tuned on DINOv2-scale models; smaller ViTs trained on less data show less severe artifact behaviour and smaller gains from registers. MAE (He et al. 2021) demonstrated that masking 75% of patches and reconstructing provides strong pretraining, with ViT-Huge achieving 87.8% accuracy using only ImageNet-1K. BEiT (Bao et al. 2022) predicted discrete visual tokens from masked patches. SimMIM (Xie et al. 2022) showed simple random masking with direct pixel prediction works well.

Efficiency improvements targeted inference cost. DynamicViT (Rao et al. 2021) dynamically prunes uninformative tokens based on learned importance scores, reducing FLOPs by 31-37% with <0.5% accuracy drop. EfficientFormer (Y. Li et al. 2022) designed latency-driven architectures achieving mobile speed. Token-to-Token ViT (T2T-ViT) (Yuan et al. 2021) progressively tokenizes images to capture local structure.

Understanding ViT behavior revealed important properties. Empirical studies (Xinlei Chen, Xie, and He 2021) showed ViT optimizes differently from CNNs, with sensitivity to optimizer, weight decay, and learning rate schedules. "How to train your ViT" (Steiner et al. 2022) provided comprehensive training recipes. Sharpness-aware minimization (SAM) (Foret et al. 2021) improved ViT generalization by optimizing for flatter minima. Adversarial robustness analysis (Naseer et al. 2021) (Shao et al. 2022) found ViT more robust than CNNs due to attention's global context. Studies on when transformers beat CNNs (Xiangning Chen, Hsieh, and Gong 2022) identified data scale and model capacity as key factors.

Cross-attention between scales improved multi-scale reasoning. CrossViT (C.-F. Chen, Fan, and Panda 2021) uses dual-branch architecture with different patch sizes, fusing via cross-attention. This achieves efficient multi-scale processing while maintaining reasonable computational cost.

The field converged on: (1) hierarchical designs for dense prediction (Swin), (2) strong regularization for data-efficient training (DeiT), (3) self-supervised pretraining for label efficiency (MAE, DINO), (4) hybrid CNN-Transformer architectures for mobile deployment, and (5) the critical importance of training recipes matching architectural choices.

Related-work leaderboard (representative numbers, ImageNet-1K top-1 unless noted, COCO box AP with Mask R-CNN or HTC++ as cited in source papers, ADE20K mIoU with UperNet):

Backbone family	Params	ImageNet	COCO box AP	ADE20K mIoU	Pretrain data
ViT-B/16 (Dosovitskiy et al. 2021)	86M	84.0%	(n/a)	(n/a)	JFT-300M
ViT-H/14 (Dosovitskiy et al. 2021)	632M	88.55%	(n/a)	(n/a)	JFT-300M
DeiT-B (Touvron et al. 2021)	86M	83.4%	(n/a)	(n/a)	ImageNet-1K + distill
Swin-L (Ze Liu et al. 2021)	197M	87.3%	58.7	53.5	ImageNet-22K
Swin V2-G (Ze Liu et al. 2022)	3.0B	90.17%	63.1	59.9	SimMIM + 70M
PVT-L (Wang et al. 2021)	61M	81.7%	42.9	44.8	ImageNet-1K
Twins-SVT-L (Chu et al. 2021)	99M	83.7%	45.7	49.8	ImageNet-1K
Focal-L (Yang et al. 2021)	197M	87.2%	58.4	55.4	ImageNet-22K
CoAtNet-7 (Dai et al. 2021)	2.44B	90.88%	(n/a)	(n/a)	JFT-3B
MaxViT-XL (Tu et al. 2022)	475M	88.7%	(n/a)	(n/a)	ImageNet-21K
CvT-21 (Wu et al. 2021)	32M	82.5%	(n/a)	(n/a)	ImageNet-1K
ConvNeXt-B (Zhuang Liu et al. 2022)	89M	83.8%	52.7	49.9	ImageNet-1K
MAE ViT-H (He et al. 2021)	632M	87.8%	53.3 (ViTDet)	53.6	ImageNet-1K SSL
DINO ViT-B/8 (Caron et al. 2021)	86M	80.1% (lin)	(n/a)	(n/a)	ImageNet-1K SSL
BEiT-L (Bao et al. 2022)	305M	87.4%	53.3	57.0	ImageNet-22K + dVAE
SimMIM SwinV2-H (Xie et al. 2022)	658M	87.1%	(n/a)	(n/a)	ImageNet-1K SSL
EfficientFormer-L7 (Y. Li et al. 2022)	82M	83.3%	1.6ms (iPhone)	(n/a)	ImageNet-1K
MobileViT-S (Mehta and Rastegari 2022)	5.6M	78.4%	(n/a)	(n/a)	ImageNet-1K

References

Bao, Hangbo, Li Dong, Songhao Piao, and Furu Wei. 2022. "BEiT: BERT Pre-Training of Image Transformers." arXiv. https://doi.org/10.48550/arXiv.2106.08254.

Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv. https://doi.org/10.48550/arXiv.2004.10934.

Cai, Zhaowei, and Nuno Vasconcelos. 2018. "Cascade R-CNN: Delving into High Quality Object Detection." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.1712.00726.

Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. "End-to-End Object Detection with Transformers." In European Conference on Computer Vision (ECCV). arXiv. https://doi.org/10.48550/arXiv.2005.12872.

Caron, Mathilde, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. "(DINO) Emerging Properties in Self-Supervised Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2104.14294.

Chen, Chun-Fu, Quanfu Fan, and Rameswar Panda. 2021. "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification." arXiv. https://doi.org/10.48550/arXiv.2103.14899.

Chen, Xiangning, Cho-Jui Hsieh, and Boqing Gong. 2022. "When Vision Transformers Outperform ResNets Without Pre-Training or Strong Data Augmentations." arXiv. https://doi.org/10.48550/arXiv.2106.01548.

Chen, Xinlei, Saining Xie, and Kaiming He. 2021. "An Empirical Study of Training Self-Supervised Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2104.02057.

Chen, Yinpeng, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. 2022. "Mobile-Former: Bridging MobileNet and Transformer." arXiv. https://doi.org/10.48550/arXiv.2108.05895.

Cheng, Bowen, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. "Masked-Attention Mask Transformer for Universal Image Segmentation." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.2112.01527.

Cheng, Bowen, Alexander G. Schwing, and Alexander Kirillov. 2021. "Per-Pixel Classification Is Not All You Need for Semantic Segmentation." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.2107.06278.

Chu, Xiangxiang, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. "Twins: Revisiting the Design of Spatial Attention in Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2104.13840.

Dai, Zihang, Hanxiao Liu, Quoc V. Le, and Mingxing Tan. 2021. "CoAtNet: Marrying Convolution and Attention for All Data Sizes." arXiv. https://doi.org/10.48550/arXiv.2106.04803.

Dalal, Navneet, and Bill Triggs. 2005. "Histograms of Oriented Gradients for Human Detection." In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2005.177.

Darcet, Timothée, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2024. "Vision Transformers Need Registers." In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2309.16588.

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. "ImageNet: A Large-Scale Hierarchical Image Database." In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2009.5206848.

Donahue, Jeff, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition." In International Conference on Machine Learning (ICML). https://arxiv.org/abs/1310.1531.

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. "An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale." arXiv. https://doi.org/10.48550/arXiv.2010.11929.

Felzenszwalb, Pedro F., David McAllester, and Deva Ramanan. 2008. "A Discriminatively Trained, Multiscale, Deformable Part Model." In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2008.4587597.

Foret, Pierre, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2021. "Sharpness-Aware Minimization for Efficiently Improving Generalization." arXiv. https://doi.org/10.48550/arXiv.2010.01412.

Girshick, Ross. 2015. "Fast R-CNN." In IEEE International Conference on Computer Vision (ICCV). arXiv. https://doi.org/10.48550/arXiv.1504.08083.

Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.1311.2524.

He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2021. "Masked Autoencoders Are Scalable Vision Learners." arXiv. https://doi.org/10.48550/arXiv.2111.06377.

He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. "Mask R-CNN." In IEEE International Conference on Computer Vision (ICCV). arXiv. https://doi.org/10.48550/arXiv.1703.06870.

Law, Hei, and Jia Deng. 2018. "CornerNet: Detecting Objects as Paired Keypoints." In European Conference on Computer Vision (ECCV). arXiv. https://doi.org/10.48550/arXiv.1808.01244.

Li, Feng, Hao Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, and Lei Zhang. 2022. "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.2203.01305.

Li, Yanyu, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. 2022. "EfficientFormer: Vision Transformers at MobileNet Speed." arXiv. https://doi.org/10.48550/arXiv.2206.01191.

Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. "Feature Pyramid Networks for Object Detection." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.1612.03144.

Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. "Focal Loss for Dense Object Detection." In IEEE International Conference on Computer Vision (ICCV). arXiv. https://doi.org/10.48550/arXiv.1708.02002.

Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. "Microsoft COCO: Common Objects in Context." In European Conference on Computer Vision (ECCV). https://arxiv.org/abs/1405.0312.

Liu, Shilong, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. 2022. "DAB-DETR: Dynamic Anchor Boxes Are Better Queries for DETR." In International Conference on Learning Representations (ICLR). arXiv. https://doi.org/10.48550/arXiv.2201.12329.

Liu, Shu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. "Path Aggregation Network for Instance Segmentation." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.1803.01534.

Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. "SSD: Single Shot MultiBox Detector." In, 9905:21-37. https://doi.org/10.1007/978-3-319-46448-0_2.

Liu, Ze, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, et al. 2022. "Swin Transformer V2: Scaling Up Capacity and Resolution." arXiv. https://doi.org/10.48550/arXiv.2111.09883.

Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows." In IEEE/CVF International Conference on Computer Vision (ICCV). arXiv. https://doi.org/10.48550/arXiv.2103.14030.

Liu, Zhuang, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. "A ConvNet for the 2020s." arXiv. https://doi.org/10.48550/arXiv.2201.03545.

Mehta, Sachin, and Mohammad Rastegari. 2022. "MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer." arXiv. https://doi.org/10.48550/arXiv.2110.02178.

Naseer, Muzammal, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2021. "Intriguing Properties of Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2105.10497.

Rao, Yongming, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification." arXiv. https://doi.org/10.48550/arXiv.2106.02034.

Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. "You Only Look Once: Unified, Real-Time Object Detection." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.1506.02640.

Redmon, Joseph, and Ali Farhadi. 2018. "YOLOv3: An Incremental Improvement." arXiv. https://doi.org/10.48550/arXiv.1804.02767.

Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2015. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." In Advances in Neural Information Processing Systems (NeurIPS). arXiv. https://doi.org/10.48550/arXiv.1506.01497.

Shao, Rulin, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. 2022. "On the Adversarial Robustness of Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2103.15670.

Steiner, Andreas, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. 2022. "How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2106.10270.

Tan, Mingxing, Ruoming Pang, and Quoc V. Le. 2020. "EfficientDet: Scalable and Efficient Object Detection." In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv. https://doi.org/10.48550/arXiv.1911.09070.

Tian, Zhi, Chunhua Shen, Hao Chen, and Tong He. 2019. "FCOS: Fully Convolutional One-Stage Object Detection." In IEEE International Conference on Computer Vision (ICCV). arXiv. https://doi.org/10.48550/arXiv.1904.01355.

Touvron, Hugo, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. "Training Data-Efficient Image Transformers & Distillation Through Attention." arXiv. https://doi.org/10.48550/arXiv.2012.12877.

Touvron, Hugo, Matthieu Cord, and Hervé Jégou. 2022. "DeiT III: Revenge of the ViT." arXiv. https://doi.org/10.48550/arXiv.2204.07118.

Tu, Zhengzhong, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. 2022. "MaxViT: Multi-Axis Vision Transformer." arXiv. https://doi.org/10.48550/arXiv.2204.01697.

Wang, Wenhai, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions." arXiv. https://doi.org/10.48550/arXiv.2102.12122.

Wu, Haiping, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. "CvT: Introducing Convolutions to Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2103.15808.

Xie, Zhenda, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. "SimMIM: A Simple Framework for Masked Image Modeling." arXiv. https://doi.org/10.48550/arXiv.2111.09886.

Yang, Jianwei, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. 2021. "Focal Self-Attention for Local-Global Interactions in Vision Transformers." arXiv. https://doi.org/10.48550/arXiv.2107.00641.

Yuan, Li, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet." arXiv. https://doi.org/10.48550/arXiv.2101.11986.

Zhang, Hao, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. 2023. "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." In International Conference on Learning Representations (ICLR). arXiv. https://doi.org/10.48550/arXiv.2203.03605.

Zhang, Zizhao, Han Zhang, Long Zhao, Ting Chen, Sercan O. Arik, and Tomas Pfister. 2021. "Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding." arXiv. https://doi.org/10.48550/arXiv.2105.12723.

Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. 2019. "Objects as Points." arXiv. https://doi.org/10.48550/arXiv.1904.07850.

Zhu, Xizhou, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. "Deformable DETR: Deformable Transformers for End-to-End Object Detection." In International Conference on Learning Representations (ICLR). arXiv. https://doi.org/10.48550/arXiv.2010.04159.

Zong, Zhuofan, Guanglu Song, and Yu Liu. 2023. "DETRs with Collaborative Hybrid Assignments Training." In IEEE/CVF International Conference on Computer Vision (ICCV). arXiv. https://doi.org/10.48550/arXiv.2211.12860.

HOG descriptor parameters (Dalal and Triggs 2005): gradient magnitudes and unsigned orientations (0-180°) are accumulated into 9-bin histograms over dense 8 $\times$ 8-pixel cells; cells are grouped into overlapping 2 $\times$ 2-cell blocks (16 $\times$ 16 pixels), each block normalized by L2-Hys (clip magnitudes at 0.2, renormalize). The concatenated block descriptors form a feature vector of length 9 bins $\times$ 4 cells/block $\times$ (block count per detection window); on the canonical 64 $\times$ 128-pixel pedestrian detection window this gives a 3780-dimensional descriptor. Classification uses a linear SVM trained on the INRIA pedestrian dataset. The paper reports roughly 11% miss rate at $10^{-1}$ false positives per image on INRIA test, at the time significantly better than prior Haar-wavelet-based methods. The 8-pixel cell, 16-pixel block, and 9-bin choices were each determined by systematic grid search on the INRIA training set; the paper notes that doubling cell size (to 16 $\times$ 16) drops performance substantially because fine gradient structure at pedestrian limb boundaries is lost. HOG was the dominant pedestrian and general object descriptor until R-CNN (2014) replaced it.↩︎
Two-stage vs one-stage trade-off: two-stage detectors filter proposals before classification, gaining accuracy on hard examples at the cost of inference latency; one-stage detectors predict densely in a single forward pass, paying for speed with the foreground/background imbalance problem. The "RetinaNet matches Faster R-CNN" result was the first empirical point at which one-stage stopped trading accuracy for speed; the EfficientDet line later pushed the Pareto frontier further.↩︎
Cross-paper FPS incomparability: each detector paper reports throughput on its own hardware (K40, Titan X, V100, A100), test resolution, and batch size, with TensorRT/half-precision sometimes folded in. Within-family FPS comparisons are reliable; cross-family comparisons require re-benchmarking on a single platform, which the literature rarely provides.↩︎
ImageNet-pretrained backbone dependence: detection pipelines are not trained from scratch; the entire family inherits ImageNet-pretrained ResNet/Swin features. The DeCAF result (Donahue et al. 2014) was the first systematic demonstration that mid-layer activations of a supervised ImageNet CNN transfer as off-the-shelf features to downstream tasks, and that observation seeded the entire pretrain-then-finetune paradigm that detection inherits. Scratch training (Detectron2 GroupNorm recipe, $\times 6$ schedule) closes the gap on COCO but at 6 $\times$ compute, and self-supervised backbones (MoCo v3, MAE) now match supervised pretraining for downstream detection.↩︎
Scale variance: detection must localize objects spanning two orders of magnitude in pixel area within the same image (a 30-pixel pedestrian and a 300-pixel bus). Single-scale prediction misses one extreme; image pyramids cost N-fold compute. Addressed by feature pyramids (FPN, PANet, BiFPN), multi-scale anchors (Faster R-CNN, SSD), per-level FCOS regression ranges, and dynamic multi-scale training (YOLOv3, EfficientDet).↩︎
Scale variance: detection must localize objects spanning two orders of magnitude in pixel area within the same image (a 30-pixel pedestrian and a 300-pixel bus). Single-scale prediction misses one extreme; image pyramids cost N-fold compute. Addressed by feature pyramids (FPN, PANet, BiFPN), multi-scale anchors (Faster R-CNN, SSD), per-level FCOS regression ranges, and dynamic multi-scale training (YOLOv3, EfficientDet).↩︎
Cross-paper FPS incomparability: each detector paper reports throughput on its own hardware (K40, Titan X, V100, A100), test resolution, and batch size, with TensorRT/half-precision sometimes folded in. Within-family FPS comparisons are reliable; cross-family comparisons require re-benchmarking on a single platform, which the literature rarely provides.↩︎
Two-stage vs one-stage trade-off: two-stage detectors filter proposals before classification, gaining accuracy on hard examples at the cost of inference latency; one-stage detectors predict densely in a single forward pass, paying for speed with the foreground/background imbalance problem. The "RetinaNet matches Faster R-CNN" result was the first empirical point at which one-stage stopped trading accuracy for speed; the EfficientDet line later pushed the Pareto frontier further.↩︎
Anchor design: hand-tuned scale and aspect-ratio priors (Faster R-CNN's 3 $\times$ 3 = 9 anchors, SSD's per-level 4 or 6, RetinaNet's 3 $\times$ 3 across 5 levels). The choice transfers poorly across datasets and adds three hyperparameters per level. Anchor-free methods (FCOS, CornerNet, CenterNet) regress from a single per-pixel reference; YOLOv3 derives priors via k-means clustering on the dataset.↩︎
Scale variance: detection must localize objects spanning two orders of magnitude in pixel area within the same image (a 30-pixel pedestrian and a 300-pixel bus). Single-scale prediction misses one extreme; image pyramids cost N-fold compute. Addressed by feature pyramids (FPN, PANet, BiFPN), multi-scale anchors (Faster R-CNN, SSD), per-level FCOS regression ranges, and dynamic multi-scale training (YOLOv3, EfficientDet).↩︎
Label assignment ambiguity: which anchor (or per-pixel location) is responsible for which ground-truth box. Faster R-CNN uses IoU 0.7/0.3 thresholds; FCOS uses smallest-area box for overlapping GTs; ATSS adaptively picks the threshold per object via mean+std of candidate IoUs; OTA and DETR cast assignment as bipartite matching. The choice matters by 1-3 AP and remains an active research area.↩︎
Scale variance: detection must localize objects spanning two orders of magnitude in pixel area within the same image (a 30-pixel pedestrian and a 300-pixel bus). Single-scale prediction misses one extreme; image pyramids cost N-fold compute. Addressed by feature pyramids (FPN, PANet, BiFPN), multi-scale anchors (Faster R-CNN, SSD), per-level FCOS regression ranges, and dynamic multi-scale training (YOLOv3, EfficientDet).↩︎
Foreground/background imbalance: a 600 $\times$ 800 image with FPN produces ~100k anchor locations against ~10 ground-truth objects; the resulting ~1:1000 ratio means easy negatives dominate gradients under plain cross-entropy. Two-stage detectors filter via the RPN to ~1:3; SSD uses 3:1 hard negative mining; RetinaNet introduced focal loss to down-weight easy examples by $(1-p_t)^\gamma$ . The classification-conv bias is initialized to $-\log((1-\pi)/\pi),\ \pi=0.01$ so initial predictions match the empirical prior.↩︎
Foreground/background imbalance: a 600 $\times$ 800 image with FPN produces ~100k anchor locations against ~10 ground-truth objects; the resulting ~1:1000 ratio means easy negatives dominate gradients under plain cross-entropy. Two-stage detectors filter via the RPN to ~1:3; SSD uses 3:1 hard negative mining; RetinaNet introduced focal loss to down-weight easy examples by $(1-p_t)^\gamma$ . The classification-conv bias is initialized to $-\log((1-\pi)/\pi),\ \pi=0.01$ so initial predictions match the empirical prior.↩︎
Scale variance: detection must localize objects spanning two orders of magnitude in pixel area within the same image (a 30-pixel pedestrian and a 300-pixel bus). Single-scale prediction misses one extreme; image pyramids cost N-fold compute. Addressed by feature pyramids (FPN, PANet, BiFPN), multi-scale anchors (Faster R-CNN, SSD), per-level FCOS regression ranges, and dynamic multi-scale training (YOLOv3, EfficientDet).↩︎
Label assignment ambiguity: which anchor (or per-pixel location) is responsible for which ground-truth box. Faster R-CNN uses IoU 0.7/0.3 thresholds; FCOS uses smallest-area box for overlapping GTs; ATSS adaptively picks the threshold per object via mean+std of candidate IoUs; OTA and DETR cast assignment as bipartite matching. The choice matters by 1-3 AP and remains an active research area.↩︎
Anchor design: hand-tuned scale and aspect-ratio priors (Faster R-CNN's 3 $\times$ 3 = 9 anchors, SSD's per-level 4 or 6, RetinaNet's 3 $\times$ 3 across 5 levels). The choice transfers poorly across datasets and adds three hyperparameters per level. Anchor-free methods (FCOS, CornerNet, CenterNet) regress from a single per-pixel reference; YOLOv3 derives priors via k-means clustering on the dataset.↩︎
Label assignment ambiguity: which anchor (or per-pixel location) is responsible for which ground-truth box. Faster R-CNN uses IoU 0.7/0.3 thresholds; FCOS uses smallest-area box for overlapping GTs; ATSS adaptively picks the threshold per object via mean+std of candidate IoUs; OTA and DETR cast assignment as bipartite matching. The choice matters by 1-3 AP and remains an active research area.↩︎
NMS heuristic: hand-crafted greedy post-processing that suppresses overlapping detections by IoU. Threshold-sensitive, fails in dense crowds, and not differentiable. Soft-NMS, DIoU-NMS, and learned NMS soften it; DETR-style Hungarian matching removes it entirely by predicting a fixed-size set with one-to-one assignment.↩︎
Occlusion and crowding: NMS suppresses overlapping detections of the same class even when they correspond to distinct instances. Soft-NMS preserves recall in crowded scenes; CrowdDet and Repulsion Loss explicitly model crowd density. Set-prediction transformers handle occlusion by construction since each query is one-to-one matched to one GT.↩︎
Slow convergence (DETR): the original DETR required 500 COCO epochs (versus 12-24 for Faster R-CNN) due to unstable Hungarian-matching gradients and the absence of explicit object priors. Deformable-DETR, DAB-DETR, and DN-DETR cut this to ~50 epochs by introducing reference points, content-aware queries, and denoising auxiliaries.↩︎
Cross-paper FPS incomparability: each detector paper reports throughput on its own hardware (K40, Titan X, V100, A100), test resolution, and batch size, with TensorRT/half-precision sometimes folded in. Within-family FPS comparisons are reliable; cross-family comparisons require re-benchmarking on a single platform, which the literature rarely provides.↩︎
ImageNet-pretrained backbone dependence: detection pipelines are not trained from scratch; the entire family inherits ImageNet-pretrained ResNet/Swin features. The DeCAF result (Donahue et al. 2014) was the first systematic demonstration that mid-layer activations of a supervised ImageNet CNN transfer as off-the-shelf features to downstream tasks, and that observation seeded the entire pretrain-then-finetune paradigm that detection inherits. Scratch training (Detectron2 GroupNorm recipe, $\times 6$ schedule) closes the gap on COCO but at 6 $\times$ compute, and self-supervised backbones (MoCo v3, MAE) now match supervised pretraining for downstream detection.↩︎
Long-tail class imbalance: COCO is roughly balanced (80 classes), but production data and LVIS exhibit power-law distributions. Detector heads trained on balanced data fail on rare classes; recipes like Equalization Loss, Seesaw Loss, and decoupled classifier finetuning address this orthogonally to the foreground/background problem.↩︎