Chat Templates and Special Tokens

A base language model will cheerfully continue any token sequence. An instruction-tuned model is different: during supervised fine-tuning (SFT) it learned to expect specific delimiter tokens that announce role boundaries - who is the user, where the assistant reply begins, where the turn ends. Get those tokens wrong at inference time and the model's behaviour degrades silently. No exception is thrown; the model just answers as if slightly drunk.

This is the core problem chat templates solve. They are not cosmetic. They are load-bearing infrastructure.

What a chat template actually is

A chat template is a Jinja2 template string stored directly in the tokeniser's chat_template attribute (and serialised alongside it in tokenizer_config.json). When you call tokenizer.apply_chat_template(messages), it renders that template against your conversation list and returns a single string (or token IDs) that is ready to pass to the model.

Two models fine-tuned from the same Mistral-7B base weight can require completely different formats:

Model	User turn format	EOS strategy
Mistral-7B-Instruct-v0.1	`<s>[INST] … [/INST]`	`</s>` after assistant turn
Zephyr-7B-beta	`<\\|user\\|>\n…</s>\n<\\|assistant\\|>`	`</s>` inline after each speaker
Llama-3-8B-Instruct	`<\\|start_header_id\\|>user<\\|end_header_id\\|>\n\n…<\\|eot_id\\|>`	`<\\|eot_id\\|>` at every turn end

These are not stylistic choices. They reflect the exact formatting of the SFT training data for each model. Feeding Llama-3 format into Mistral-Instruct produces coherent text that ignores instructions.

A minimal Jinja2 template looks like this:

{%- for message in messages %}
    {{- '<|' + message['role'] + '|>\n' }}
    {{- message['content'] + eos_token }}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|assistant|>\n' }}
{%- endif %}

The - after {% and {{ strips whitespace that would otherwise appear in the rendered string. Extra whitespace that was not present during training measurably hurts performance because the model has never seen those sequences.

The taxonomy of special tokens

Special tokens are reserved vocabulary entries that carry structural rather than semantic meaning. They are added to the tokeniser vocabulary explicitly rather than emerging from BPE or Unigram merges.

Universal control tokens appear in almost every model:

bos_token (<s>, <|begin_of_text|>) - marks the start of a sequence; the model almost always receives one at position 0.
eos_token (</s>, <|end_of_text|>) - signals that generation should stop; the decoding loop checks for this.
pad_token - fills shorter sequences during batched training to make all sequences the same length; usually masked out in the loss.
unk_token - represents out-of-vocabulary characters; rare in modern byte-level BPE vocabularies.

Chat-specific control tokens vary by model family. Some are added as new vocabulary entries; others are ASCII strings that happen to be rare enough that the tokeniser encodes them as single tokens after deliberate design:

Llama 2 used printable strings [INST], [/INST], <<SYS>> that were injected as text but were unlikely to appear in the pre-training corpus.
Llama 3 and ChatML (<|im_start|>, <|im_end|>) added dedicated token IDs with no pre-training history, making the semantic boundary unambiguous.
The <|eot_id|> (end-of-turn) token in Llama 3 serves a finer-grained purpose than EOS: it ends a speaker's turn without necessarily stopping generation.

The important distinction is between tokens that existed in the base model's vocabulary (which carry pre-training associations) and tokens that were freshly initialised during chat fine-tuning (which have no prior meaning and must learn theirs entirely from SFT data). Fresh tokens require sufficient SFT examples to converge; too few examples and the token embedding stays near its random initialisation.

The add_generation_prompt flag

Because the model is a next-token predictor, format determines behaviour directly. If you submit a conversation that ends after the last user turn, the model faces an ambiguous context: should it continue the user's message, or start replying?

The add_generation_prompt=True flag appends the opening tokens of an assistant turn (e.g. <|start_header_id|>assistant<|end_header_id|>\n\n) without closing them. This primes the model to generate in assistant mode. The flag is set automatically by TextGenerationPipeline.

During training, add_generation_prompt must be False (and is by default when you call apply_chat_template for dataset preprocessing). Adding the assistant header during training would teach the model to generate those tokens as part of its response content.

A complementary parameter, continue_final_message=True, suppresses the closing EOS of the last turn so generation can extend an existing partial reply - useful for structured output prefilling or chain-of-thought steering.

Why standardisation failed and what emerged

Every major lab had already committed to a format before anyone proposed a common standard. Llama 2's [INST] format, GPT-4's ChatML (<|im_start|>), Alpaca's ### Instruction: header, and Vicuna's USER: prefix all existed before the apply_chat_template API was introduced. Retrofitting a single format would have broken every existing fine-tuned derivative.

ChatML (<|im_start|>{role}\n{content}<|im_end|>) has emerged as the de facto new-model default because it cleanly separates role metadata from content and is used by Qwen, Phi-3, and several other recent families. But it is a convention, not a standard; you cannot assume it for an arbitrary model.

The practical consequence: always load the tokeniser from the model card and call apply_chat_template. Never hardcode a format string.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
    {"role": "system", "content": "You are a concise assistant."},
    {"role": "user",   "content": "What is backpropagation?"},
]
prompt = tok.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

This is immune to format drift: if Meta releases a Llama-3.1 revision with a different template, the tokeniser update propagates the fix automatically.

When it falls down

Template-training mismatch on community fine-tunes. Many community SFT datasets (Alpaca, ShareGPT derivatives) were formatted without apply_chat_template and may use a different format than the base tokeniser's template. Fine-tuning on such data and then serving with apply_chat_template at inference creates a mismatch. The model's loss during SFT never saw the "correct" special tokens.

Duplicate special tokens. If you call apply_chat_template(tokenize=False) and then tokenise the resulting string with add_special_tokens=True, the tokeniser may prepend a second bos_token. This doubles a token the model only saw once during training and subtly corrupts the position encodings from position 0.

Loss masking errors during SFT. During supervised fine-tuning, the convention is to compute loss only on assistant tokens (masking user and system turns to -100). If the template applies incorrectly - for instance, missing the generation-prompt header - the loss mask may accidentally supervise the model to reproduce user messages, causing instruction-following collapse.

Cross-language template portability. Jinja2 templates that call Python methods (.lower(), .items()) break in JavaScript and Rust inference runtimes. The HuggingFace Transformers documentation recommends replacing Python method calls with Jinja filters (|lower, |dictitems) to keep templates portable across tokenisation backends like tokenizers (Rust) and llama.cpp.

Freshly initialised special tokens in extended fine-tuning. When you add new special tokens (for tool-call roles, reasoning blocks, etc.) and fine-tune further, those tokens start with random embeddings. If the new SFT corpus is small relative to the vocabulary update, the token embedding may not converge, producing unstable outputs near those tokens. Monitoring the embedding norm for new special tokens during training is a useful diagnostic.

System prompt injection. Because the system turn is injected at position 0 before user input, a user-supplied string that contains the literal system-turn delimiter can escape the user turn and rewrite system instructions. Sanitising input against known delimiter tokens is necessary for any production deployment that constructs the system prompt partly from user data.

What a chat template actually is

The taxonomy of special tokens

The add_generation_prompt flag

Why standardisation failed and what emerged

When it falls down

Further reading