Supervised Fine-Tuning for Instructions

GPT-3 could write poetry, debug code, and translate languages, yet answering a direct question like "summarise this document in three bullet points" would often produce more prose, alternative phrasings, or a continuation of the document. The model was not broken; it had simply never been trained to treat a prompt as a command. Supervised fine-tuning (SFT) for instructions is the surgical intervention that closes that gap.

What SFT actually does

Pretraining optimises a single objective: predict the next token given all previous tokens. The training corpus is a near-random slice of the internet, so the model learns to continue text in whatever register it finds itself in. A question looks like the start of an FAQ thread, and the model happily continues the thread. It has no signal that "question" should map to "answer."

SFT repurposes the same cross-entropy loss, but on a curated dataset of (instruction, response) pairs. The input format is typically a structured prompt template:

### Instruction
Summarise the following in three bullet points.

### Input
<document text>

### Response
<model output>

During fine-tuning, loss is computed only on the response tokens; the instruction and input tokens are masked. This is the key detail: the model is not rewarded for predicting the instruction perfectly, only for generating the correct response given that instruction. After a few thousand gradient steps on a few thousand such examples, the model has internalised a conditional distribution: "given that someone is asking me something, produce a helpful answer."

The compute cost is modest compared to pretraining. A 7 billion parameter model pretrained at a cost of millions of dollars can be instruction-tuned in hours on a single 8-GPU node.

The data problem is the real problem

The objective is trivial; assembling good data is not. Two properties dominate quality:

Coverage. Instructions should span a wide range of task types: summarisation, classification, translation, code generation, question answering, creative writing, factual recall, refusals. A dataset heavy in one category produces a model that defaults to that category when it is uncertain.
Response quality. Because the model is supervised, bad responses produce bad models. Labellers with genuine subject-matter expertise are expensive, and crowdsourced annotations frequently introduce subtle errors that compound at scale.

The LIMA paper (Zhou et al., 2023) demonstrated that 1,000 carefully chosen high-quality examples could rival models trained on orders of magnitude more data. The practical implication is blunt: 10,000 mediocre examples are probably worse than 1,000 excellent ones.

Data sourcing strategies range across a spectrum:

Strategy	Example	Tradeoff
Human demonstration	InstructGPT's SFT stage	Highest quality, very expensive
Human-curated public NLP tasks	FLAN (Wei et al., 2022)	Broad coverage, rigid format
Distillation from a stronger model	Alpaca, Self-Instruct	Cheap, risk of inherited errors
Hybrid (seed human + model expansion)	Open-Hermes, Dolphin	Practical middle ground

Self-Instruct (Wang et al., 2023) showed that a model can bootstrap its own instruction data: generate candidate instructions, filter duplicates and low-quality outputs, then fine-tune on what remains. This reduces the human bottleneck but does not eliminate it; the seed set and filter heuristics still require human judgement.

The mechanics of fine-tuning

In practice, SFT is standard full-parameter fine-tuning or, more commonly now, parameter-efficient fine-tuning (PEFT) via LoRA. The key hyperparameters are:

Learning rate. Typically 1e-5 to 2e-5, substantially lower than pretraining. Too high and the model catastrophically forgets its pretrained knowledge; too low and the instruction format never sticks.
Epochs. Usually 2 to 3 passes over the instruction dataset. Beyond that, overfitting to the training formats becomes visible: the model starts echoing the exact sentence structures from training examples.
Sequence length. Instruction datasets often include long documents. Truncating context at 2,048 tokens versus 8,192 tokens changes which tasks the model can handle.

A minimal PyTorch training loop (with Hugging Face Trainer) looks like:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# loss_mask marks response tokens 1, instruction tokens 0
training_args = TrainingArguments(
    output_dir="./sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    bf16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=instruction_dataset,  # pre-tokenised with labels masked
)
trainer.train()

The dataset's labels tensor should be -100 (the PyTorch cross-entropy ignore index) for all instruction and padding tokens, so only response tokens contribute to the loss.

SFT in the post-training pipeline

SFT is stage one of a three-stage pipeline in production systems like InstructGPT:

SFT on human demonstrations.
Reward model training on human preference rankings between model outputs.
RLHF / PPO to optimise the SFT model against the reward model.

The SFT checkpoint serves two roles: it is the initial policy for RL fine-tuning, and it is the reference model against which KL-divergence penalties are applied during PPO. If the SFT checkpoint is poor, the RL stage cannot compensate; it can shift the distribution but cannot invent capabilities the base model never learned.

More recent pipelines (DPO, ORPO, SimPO) collapse the reward modelling and RL stages into a single offline training step, but they still start from an SFT checkpoint. The quality of that checkpoint sets a hard floor on what alignment can achieve.

When it falls down

Spurious format learning. The model can learn to produce text that looks like a good response without understanding the task. If every training response to a classification instruction begins "The sentiment is:", the model will produce that prefix even when the input is ambiguous or malformed.

Exposure to low-quality distilled data. Models trained on outputs from GPT-4 or Claude inherit not just capabilities but also failure modes: confident hallucinations, refusals calibrated to a different safety policy, and stylistic tics that are difficult to fine-tune away.

Catastrophic forgetting. Aggressive learning rates or too many epochs can erode capabilities from pretraining. A model fine-tuned for customer support might degrade on mathematical reasoning because that distribution is underrepresented in the instruction data.

Distribution shift at inference. SFT models are brittle to prompt formats they have not seen. A model trained exclusively on the Alpaca template (### Instruction / ### Response) will often behave erratically when a user sends a raw, unformatted message in a chat interface.

No preference information. SFT trains on demonstrations, not preferences. The model cannot distinguish between a response that is acceptable and one that is excellent, because both are treated as equally correct targets. Preference-based methods (RLHF, DPO) exist precisely to address this ceiling.

What SFT actually does

The data problem is the real problem

The mechanics of fine-tuning

SFT in the post-training pipeline

When it falls down

Further reading