Multimodal LLMs: LLaVA, Flamingo, GPT-4V

A language model is a sequence-in, sequence-out function. The cleanest way to add vision is to turn images into more tokens, then let the language model do what it already does. That sentence describes 95% of open multimodal models shipped since 2023. The interesting design choices are: which vision encoder, how to project image features into the LLM's token space, whether to interleave images with text or keep them separate, and what training data and recipe turn this Frankenstein into something that actually follows instructions about images.

The dominant recipe

LLaVA (Liu et al, April 2023) crystallised the playbook:

Vision encoder. A frozen pretrained CLIP-style ViT (CLIP ViT-L/14 in the original LLaVA). Produces a grid of patch features (e.g. 24x24 = 576 tokens at 336x336 input).
Projector. A small MLP (originally one linear layer, later a two-layer MLP) maps the vision feature dim to the LLM's token embedding dim.
LLM. A pretrained instruction-tuned LLM (Vicuna in the original, Llama / Mistral / Qwen in later forks). Frozen at first, fine-tuned later.

At inference, prepend the projected image tokens to the text tokens and feed the lot into the LLM. The model treats image tokens as if they were text - same attention, same position encoding, same loss.

Variant	Vision encoder	Projector	LLM	Training
LLaVA-1.5	CLIP ViT-L	2-layer MLP	Vicuna 7B/13B	558k pretrain + 665k instruct
Idefics2	SigLIP-SO-400M	Perceiver resampler	Mistral 7B	Web-scale interleaved
Qwen2-VL	DFN-CLIP variant	MLP + dynamic resolution	Qwen2 LLM	Multi-stage, Chinese + English
InternVL 2/3	InternViT	MLP	Multiple LLM backbones	Multi-stage, OCR-heavy

The recipe is so standard that adapting a new LLM to vision is now a long weekend's work, not a research project.

Why LLaVA shipped quickly

LLaVA's contribution was less the architecture (others had stitched encoders to LLMs before) and more the data. The team used GPT-4 (text-only) to generate visual instruction-tuning data: feed it image captions and bounding boxes from COCO, ask it to write plausible Q&A and reasoning chains, then train the visual model to produce those outputs given the actual image. 158k synthetic instruction-following examples, trained in a day on 8 A100s.

Two consequences:

The training recipe is reproducible. Anyone with a small GPU budget and an LLM API can generate equivalent data. The open multimodal community has been iterating on this loop ever since.
Instruction-following capability transferred almost for free. Because the LLM was already instruction-tuned and the projector is small, only the projector and a low-rank LLM adapter need to learn how to consume images. The world knowledge in the LLM is intact.

The dominant recipe

Why LLaVA shipped quickly

Keep reading with Pro.