Contrastive Vision-Language: CLIP

Before CLIP, image classification was a labelled-dataset problem. You picked 1,000 ImageNet classes, you trained a model, and the model knew those 1,000 things. Radford et al's 2021 CLIP paper reframed the task: instead of predicting one of N labels, predict which caption goes with which image. The supervisory signal becomes natural-language captions scraped from the web - effectively unlimited - and the resulting model recognises arbitrary concepts described in English at test time. That switch from closed-vocabulary classification to open-vocabulary alignment is the most important architectural idea in vision since ResNet.

The contrastive objective

Train two encoders side by side:

An image encoder f (a ViT or ResNet) producing a 512-dim embedding per image.
A text encoder g (a transformer) producing a 512-dim embedding per caption.

Within a batch of N image-text pairs, compute the N x N matrix of cosine similarities. The diagonal entries are the true matches; the off-diagonals are negatives. Apply a symmetric cross-entropy loss over rows and columns:

logits = f(I) @ g(T).T / temperature       # N x N
loss_i = cross_entropy(logits, eye(N))     # image-to-text
loss_t = cross_entropy(logits.T, eye(N))   # text-to-image
loss = (loss_i + loss_t) / 2

The temperature is learned, typically settling around 0.01. Batch size matters enormously - larger batches give more negatives, which makes the embedding space more discriminative. The original CLIP used batch size 32,768 across 256 V100s.

Why a shared embedding space is useful

The geometry that falls out is the whole product:

Zero-shot classification. To classify an image into 1,000 ImageNet categories, embed all 1,000 class names with the text encoder, embed the image with the image encoder, pick the nearest text. No fine-tuning, no labelled examples. CLIP ViT-L hits 76% top-1 on ImageNet zero-shot, roughly matching a fully supervised ResNet-50.
Cross-modal retrieval. Embed a corpus of images; query with text; rank by cosine similarity. Same with text corpus and image query. Production image search at most modern platforms is some descendant of this.
Grounding for generation. Stable Diffusion conditions on CLIP text embeddings. Open-vocabulary detectors (OWL-ViT, GLIP) condition on them. Any system that needs to consume natural-language descriptions of visual concepts plugs into CLIP space.
Compositional probing. Embed pairs of concepts and look at vector arithmetic. The space is not perfectly compositional (see below) but it is structured enough to be useful.

The 400M-pair WebImageText recipe

OpenAI built WIT, a 400M image-text pair dataset, by querying 500,000 search terms drawn from English Wikipedia and balancing the result by query. The selection was the secret sauce - dataset quality matters more than scale beyond a certain point. Key choices:

Pairs not triplets. Image, caption. No human labels, no class taxonomy.
Query balancing. Cap the number of pairs per query so popular concepts do not swamp rare ones.
Filtering for English. Multilingual CLIP variants came later (mCLIP, SigLIP-multilingual).

Training cost was reported around 256 V100s for two weeks. Replicate at home: difficult.

OpenCLIP, SigLIP, EVA-CLIP

OpenAI shipped weights but not data. The follow-ups closed both gaps:

OpenCLIP (LAION, mlfoundations) reproduced CLIP using the public LAION-400M and LAION-2B datasets, hitting 78% ImageNet zero-shot with ViT-H/14. The first fully open CLIP at frontier quality.
EVA-CLIP (Sun et al, 2023) added masked image modelling pretraining for the image encoder, better init for the text encoder, and stronger augmentation. ViT-E/14 hit 82% zero-shot ImageNet.
SigLIP (Zhai et al, 2023) replaced the softmax contrastive loss with a per-pair sigmoid loss. Each pair is judged independently as positive or negative - no batch-wide normalisation. This decouples performance from batch size and trains better at smaller batches.

SigLIP is now the default backbone for most open multimodal models (PaliGemma, Idefics, parts of Gemma 3).

What zero-shot accuracy hides

A CLIP ViT-L scoring 76% on ImageNet zero-shot looks like supervised ImageNet performance. The number obscures three things:

It is not robust in the supervised sense. CLIP is brittle to typographic attacks (a sticker reading "iPod" on an apple makes it predict iPod), and accuracy degrades sharply on distribution shifts beyond the natural-image regime.
Class-name engineering matters. "A photo of a cat" beats "cat" by several points; prompt ensembles ("a photo of a {}", "a sketch of a {}", "an art of a {}") add another 1-3 points. Reported numbers usually include this.
Compositional weaknesses. CLIP struggles with relations and binding - it cannot reliably distinguish "a red cube on a blue sphere" from "a blue cube on a red sphere." This is why downstream multimodal LLMs do not simply read off CLIP features; they pipe them through an LLM that can reason compositionally.

Where it falls down

Fine-grained discrimination. Bird species, plant taxonomy, medical imagery - the long tail of visual concepts that web captions never describe precisely.
OCR and text in images. Standard CLIP recognises that text is present but not what it says. SigLIP-2 and PaliGemma do better; specialised OCR models still beat both.
Counting and spatial reasoning. CLIP cannot reliably say there are three apples or that the apple is left of the orange. This is a fundamental limitation of the alignment objective, not a data problem.