Vision & Multimodal
Vision Transformers (ViT)
How treating an image as a sequence of patches let pure transformers beat CNNs once data crossed the 300M-image mark, and what the architecture gave up to get there.
intermediate · 8 min read
A transformer has no notion of locality, no notion of translation invariance, and no notion of 2D structure. The orthodox 2017-2020 view was that none of those could be learned from data, which is why CNNs ruled vision. Dosovitskiy et al's 2020 ViT paper showed the orthodox view was wrong - or rather, was right only in the small-data regime. Given enough pretraining images, a plain transformer beats every CNN architecture on ImageNet. The story is really about inductive bias as a data-efficiency trade.
Image to patches
The whole architectural change fits on one slide:
- Take a 224x224x3 image. Cut it into a grid of 16x16 patches. You have 14x14 = 196 patches.
- Flatten each patch to a 768-dim vector (
16 * 16 * 3 = 768). - Linearly project to the model dimension (also 768 for ViT-Base).
- Prepend a learnable
[CLS]token, add positional embeddings, run through a vanilla transformer encoder. - The final
[CLS]representation goes through a classification head.
That is it. No convolutions, no pooling, no spatial pyramid. Patches are tokens; an image is a 197-token sequence (196 patches + CLS).
Positional embeddings for 2D
A standard transformer is permutation-invariant - shuffling the input tokens produces the same output. Vision needs spatial structure, so positions matter. ViT defaults to learnable 1D positional embeddings, one per patch index, added to the patch embeddings:
z_0 = [x_cls; E*p_1; E*p_2; ...; E*p_N] + E_pos
Counterintuitively, the authors found that 2D-aware embeddings (separate row/column embeddings, or sinusoidal grids) gave no measurable improvement over the flat 1D learned version. The model learns the 2D structure on its own from data. Variable-resolution inference is handled by bilinearly interpolating the embedding grid - awkward but it works.
Later architectures introduced spatial priors back in:
- RoPE-2D (used in many recent multimodal models) extends rotary embeddings to two axes.
- Conditional positional encoding generates position information from the local patches themselves.
Why ViT needed JFT-300M
On ImageNet-1k (1.3M images) ViT loses to ResNets at every model size. On ImageNet-21k (14M) it is competitive. On Google's internal JFT-300M (300M images) ViT-Huge beats every CNN by 1-2 points and trains in a quarter of the compute. The pattern is unambiguous:
| Pretraining data | Best CNN (BiT) | Best ViT |
|---|---|---|
| ImageNet-1k (1.3M) | 76.5 | 73.5 |
| ImageNet-21k (14M) | 83.5 | 84.0 |
| JFT-300M | 87.5 | 88.6 |
The convolution's translation equivariance and locality are priors. A prior helps when data is scarce. With enough data, the prior becomes dead weight - the model could learn translation equivariance from examples if it wanted to, and might prefer something better. ViT's hyperparameters are also more uniform across scales than CNN's bag of tricks, which made the scaling story cleaner.
Swin and the hybrid revival
A pure ViT computes attention over all 196 tokens at every layer. That is fine at 224x224 but quadratic in image size - a 1024x1024 input has 4096 patches and 16M attention entries per layer. Dense prediction tasks (detection, segmentation) need that resolution.
Swin Transformer (Liu et al, 2021) reintroduces locality:
- Compute attention within fixed windows (e.g. 7x7 patches).
- Shift the windows between successive layers so information crosses window boundaries.
- Build a hierarchical pyramid - patches merge at each stage, halving resolution and doubling channels, like a CNN.
The result is linear in image size, supports any resolution, and ships features at multiple scales that detection heads expect. Most modern "ViT-style" detection and segmentation backbones (Swin v2, ConvNeXt as a CNN response, EVA-02) sit somewhere on the conv-transformer spectrum.
What ViT-based models won and lost
Won:
- Large-scale image classification. Pure ViTs at JFT or LAION scale beat everything.
- Multimodal alignment. CLIP, SigLIP, Gemini vision, GPT-4V - every frontier vision-language encoder is ViT-derived. The token sequence interface drops cleanly into a language model.
- Self-supervised pretraining. MAE (masked autoencoders) and DINOv2 both prefer ViT backbones; the patch tokenisation is a natural masking unit.
Lost (or at least conceded):
- Small-data classification. Below 10M pretrained images, a ResNet or ConvNeXt with strong augmentation matches or beats ViTs at a fraction of the compute.
- Edge inference. MobileNets and EfficientNets still dominate phone-scale deployment because depthwise convolutions are 5-10x cheaper than attention at low resolution.
- Dense prediction without a hybrid. Pure ViT does not naturally produce multi-scale features. Detection and segmentation backbones almost always reintroduce some hierarchy (Swin, ViTDet, ConvNeXt).
The inductive-bias trade-off
The lesson generalises beyond vision:
- A strong architectural prior is a substitute for data. When data is scarce, build the prior in.
- A weak prior plus enough data can match or exceed the strong prior, because the model learns the right prior rather than the assumed one.
- Compute is the wildcard. Strong priors are usually cheaper to train. The break-even data scale where weak-prior architectures win is moving downward as compute gets cheaper.
CNNs did not die. They migrated to where their assumptions still hold: small models, small data, edge silicon. The frontier moved up the data curve to where the assumptions are constraints rather than gifts.
Where it falls down
- Quadratic attention at high resolution. Without windowed or sparse variants, a 1024x1024 image is uncomfortable on a single GPU.
- Sample efficiency. A ViT trained from scratch on a 5,000-image dataset overfits where a small CNN would not.
- Positional embedding extrapolation. Test-time resolutions far from training need bilinear interpolation of the position grid, which loses accuracy.
Further reading
- An Image is Worth 16x16 Words - the original ViT paper, Dosovitskiy et al 2020.
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows - Liu et al 2021, the dominant hybrid design.
- An Image is Worth 16x16 Words, What is a Video Worth? - the video extension that motivated patch-time tokenisation in later video models.