← Concept library

Large Language Models

Pretraining Objectives

Why the loss a model is trained to minimise decides what it can become, and how next-token prediction beat masked and span objectives to own the generative era.

intermediate · 9 min read

Before a model is helpful, honest, or aligned, it is a function fit to a single loss over raw text. That loss, the pretraining objective, is the most consequential design choice in the whole pipeline; it decides what signal the model extracts from a trillion tokens and, downstream, what the model can be made to do at all. The reason almost every generative system today is decoder-only is not architectural taste. It is that one objective, next-token prediction, turned out to be both a strong learning signal and the thing you actually want at inference.

Three families of objective

Pretraining objectives differ in what they hide from the model and ask it to reconstruct.

Objective Hides Reads context Canonical model
Autoregressive (next-token) the next token left only (causal) GPT family
Masked (MLM) ~15% of tokens at random both directions BERT
Span corruption contiguous spans, replaced by sentinels both (encoder) + generate (decoder) T5

Autoregressive training factorises the probability of a sequence as a product of conditionals, p(x) = prod_t p(x_t | x_<t), and minimises the negative log-likelihood of each true next token. The model only ever sees leftward context, enforced by the causal mask (see transformer-architecture).

Masked language modelling, BERT's objective, corrupts roughly 15% of tokens and asks the model to recover them using context from both sides. Bidirectionality makes the representations excellent for classification and embedding, but the model never learns to generate a sequence left to right, so it is awkward as a generator.

Span corruption, T5's objective, sits between the two: mask contiguous spans, replace each with a sentinel token, and have a decoder emit the missing spans. It reframes every task, translation, summarisation, classification, as text-to-text.

Why next-token prediction won

The autoregressive objective has a property the others lack: the training task and the inference task are identical. You train the model to predict the next token, and at deployment you generate by predicting the next token. There is no gap to bridge. MLM, by contrast, trains on a corruption pattern (15% masking) that never occurs at generation time, so turning a BERT into a fluent generator means fighting the objective it was built on.

Next-token prediction is also a deceptively complete task. To predict the next token well across the whole internet, a model is implicitly pushed to learn syntax, world facts, arithmetic, translation, and code structure, because all of those reduce uncertainty about what comes next. The objective is narrow; the competence it forces is broad. That is the engine behind in-context learning and, ultimately, behind every instruction-tuned assistant: the base model learned a usable representation of language and the world purely from compression pressure.

This does not make the other objectives obsolete. Encoder models trained with MLM still produce the strongest embeddings for retrieval (see embeddings-semantic-search), and encoder-decoder span-corruption models remain competitive on tasks with a clear input-to-output mapping. But for an open-ended generative assistant, autoregression is the objective whose train and test distributions match.

From objective to assistant

A base model fresh off pretraining is a next-token predictor, not an assistant; ask it a question and it may continue with more questions, because that is a plausible continuation of the text. Turning it into something helpful is a separate stage: supervised fine-tuning on demonstrations, then preference optimisation (see rlhf and dpo-preference-optimisation). None of that alignment work creates capability; it surfaces and directs capability the pretraining objective already instilled. This is why the base-model objective matters so much: you cannot fine-tune in a skill the pretraining loss never rewarded.

When it falls down

  • Objective-task mismatch. Picking MLM because bidirectionality sounds strictly better, then needing a generator, is a costly error. Match the objective to what the model must do at inference, not to which sounds more powerful.
  • Exposure bias. Autoregressive models train on ground-truth prefixes but generate on their own (possibly wrong) prefixes, so errors can compound across a long generation. This is part of why decoding strategy matters (see sampling-decoding).
  • The objective is a ceiling on alignment. A capability absent from the pretraining signal cannot be reliably fine-tuned in later; preference optimisation redistributes probability mass, it does not teach new skills.
  • Data quality is silent. The same objective on web sludge versus curated tokens yields very different models. The loss does not care what it compresses, so the corpus does the quiet work.

Further reading

Sign in to save and react.
Share Copied