Concept library
Learn the stack, end-to-end.
Structured paths from Python through transformers, RAG, and agents. Free concepts unlock more as you progress; premium concepts require a Pro subscription.
Foundations 3 concepts
The vocabulary every practitioner needs before going deeper: what a language model is, what tokens are, and how transformers learn.
-
Pro
RLHF: Reward Modeling, PPO, and the DPO Trade-off
How human preference data becomes a reward model, how PPO uses it to fine-tune an LLM, and why DPO often replaces the whole pipeline.
intermediate · 3 min
-
What is a Large Language Model?
A grounded definition: an LLM is a neural network trained to predict the next token. Everything else — reasoning, tools, agents — is built on top.
beginner · 5 min
-
Tokens & Tokenization
Models don't see words — they see tokens. Understanding BPE/SentencePiece changes how you reason about context windows, cost, and weird edge cases.
beginner · 6 min
NLP Foundations 4 concepts
Attention, tokenisation, embeddings, and the building blocks that every modern LLM still relies on.
-
Attention Mechanism
How attention lets a model focus on the relevant parts of a sequence by computing weighted dependencies between every pair of positions.
intermediate · 7 min
-
Embeddings and Semantic Search
How dense vectors turn text into a geometry of meaning, and how cosine similarity lets you find related content without keywords.
beginner · 7 min
-
Tokenisation and BPE
How models turn raw text into integer tokens, and why the vocabulary choice silently shapes model behaviour.
beginner · 6 min
-
Transformer Architecture
The encoder-decoder stack that replaced recurrence and powered every modern LLM.
intermediate · 8 min
Mathematical Foundations 6 concepts
Linear algebra, probability, calculus, optimisation, and the numerical-computation reality that ML engineers actually hit.
-
Calculus and Gradients
Partial derivatives, the chain rule as the engine of backprop, why second-order methods are rare in deep learning, and what gradient clipping actually does.
intermediate · 8 min
-
Linear Algebra for ML
Vector spaces, matrix decompositions, and why low-rank structure underlies LoRA, PCA, and the quantisation tricks that make modern LLMs cheap to serve.
intermediate · 9 min
-
Numerical Computation Gotchas
Catastrophic cancellation, the log-sum-exp trick, mixed-precision training, the determinism tax, and how to actually debug a NaN in a 70B model.
intermediate · 8 min
-
Pro
Optimisation Theory
Convexity, why SGD finds good solutions on non-convex losses, saddle points at scale, momentum as a damped oscillator, and learning-rate schedules as implicit regularisation.
advanced · 10 min
-
Probability and Information Theory
Distributions, expectations, entropy, KL, and why softmax + cross-entropy is the canonical pair that secretly underlies almost every LLM loss.
intermediate · 9 min
-
Pro
Statistical Learning Theory Primer
Bias-variance, PAC-learning, VC dimension, why deep nets break classical generalisation bounds, double descent, and what scaling laws are actually saying.
advanced · 10 min
Deep Learning 6 concepts
CNNs, recurrent networks, normalisation, regularisation, optimisers, and automatic differentiation.
-
Backpropagation and Automatic Differentiation
How reverse-mode autodiff turns the chain rule into an efficient gradient algorithm, and the design choices PyTorch and JAX make to implement it at scale.
intermediate · 8 min
-
Convolutional Neural Networks
Why weight sharing and local receptive fields make CNNs the right inductive bias for images, and where ViTs took over.
intermediate · 8 min
-
Dropout and Modern Regularisation
Why dropout was the dominant regulariser for a decade and why modern LLM training mostly skips it in favour of letting data do the work.
beginner · 6 min
-
Pro
Normalisation: BatchNorm, LayerNorm, RMSNorm
Why normalisation accelerates training, why transformers use LayerNorm instead of BatchNorm, and why RMSNorm is now the default in Llama-class models.
intermediate · 7 min
-
Pro
Optimisers: SGD, Adam, AdamW, Lion
How the standard optimiser stack evolved from plain SGD through Adam to memory-cheaper variants like Lion and Muon, and which learning-rate schedules actually work at scale.
intermediate · 9 min
-
Recurrent Networks: RNN, LSTM, GRU
How gating fixed the vanishing-gradient problem in RNNs, and why transformers displaced them everywhere except streaming and on-device workloads.
intermediate · 8 min
Architectures & Scaling 2 concepts
Inside the model: attention, training objectives, parameter scaling, and the architectural choices that shape what an LLM can do.
-
The Attention Mechanism
Attention lets every token decide which other tokens to look at. It's the core operation that made transformers replace RNNs.
intermediate · 8 min
-
Pretraining vs Instruction Tuning
How a base model becomes a chatbot. Three stages — pretraining, supervised fine-tuning, RLHF — each doing something specific.
intermediate · 9 min
Large Language Models 5 concepts
Pretraining, fine-tuning vs RAG, RLHF/DPO, chain-of-thought, mixture-of-experts - the LLM craft.
-
Chain of Thought Prompting
Why telling the model to think step by step radically improves reasoning, and when it actively hurts.
beginner · 5 min
-
Pro
Fine-tuning vs RAG
When to teach the model new behaviour vs when to retrieve fresh context at runtime.
intermediate · 8 min
-
Pro
Mixture of Experts
Why MoE models can be 10x cheaper to serve than dense models of the same capability, and what makes them hard to train.
advanced · 8 min
-
Pro
Reinforcement Learning from Human Feedback
How preference data and PPO turn a pretrained language model into a helpful, honest, harmless assistant.
advanced · 10 min
-
Retrieval Augmented Generation
The end-to-end RAG pipeline from chunking through retrieval, reranking, and grounded generation.
intermediate · 9 min
Inference Optimisation 6 concepts
KV cache, FlashAttention, speculative decoding, quantisation, vLLM, MoE serving - the production-grade inference stack.
-
Pro
FlashAttention
An IO-aware attention kernel that is both faster and lower-memory than the textbook implementation by tiling computation to keep activations in SRAM.
advanced · 9 min
-
KV Cache
Why decoder inference is quadratic without a KV cache and linear with one, and why managing that cache is now the dominant memory problem in LLM serving.
intermediate · 8 min
-
Pro
Mixture-of-Experts Inference
Why serving MoE models is harder than serving dense models of equivalent quality, and how DeepSeek and Mistral made it work in production.
advanced · 9 min
-
Quantisation - INT8, INT4, FP8
How to cut weight and activation precision below 16 bits without wrecking quality, and which scheme to pick for which deployment.
intermediate · 9 min
-
Pro
Speculative Decoding
Use a small draft model to propose tokens that a large verifier accepts or rejects in parallel, giving lossless 2-3x latency wins on autoregressive generation.
advanced · 9 min
-
vLLM and Continuous Batching
Why static batching wastes most of your GPU on variable-length workloads, and how iteration-level scheduling combined with PagedAttention raises throughput by an order of magnitude.
intermediate · 8 min
Training Infrastructure 5 concepts
Data, tensor and pipeline parallelism, ZeRO/FSDP, mixed precision, gradient checkpointing, and the engineering of large-scale training.
-
Data Parallelism and DDP
How replicating the model and sharding the batch across GPUs scales training, and why AllReduce is the primitive every framework eventually depends on.
intermediate · 8 min
-
Pro
Gradient Checkpointing, Activation Recomputation, and CPU Offload
Why activations - not weights - usually dominate training memory, and how recomputation and CPU/NVMe offload trade compute and bandwidth to fit larger models.
intermediate · 9 min
-
Mixed-Precision Training (FP16, BF16, FP8)
How lower-precision formats halve memory and double throughput on tensor cores, why BF16 displaced FP16 for training, and what FP8 changes on H100 and Blackwell.
intermediate · 8 min
-
Pro
Tensor and Pipeline Parallelism
How frontier labs split a model across thousands of GPUs by sharding within layers (tensor parallel) and across layers (pipeline parallel), and how to pick the split.
advanced · 10 min
-
Pro
ZeRO and FSDP
How sharding optimiser state, gradients, and parameters across data-parallel ranks turns a memory problem into a bandwidth problem, and why FSDP is now the PyTorch default.
advanced · 9 min
Reasoning Models 6 concepts
Test-time compute, process reward models, CoT/self-consistency/search, the o-series and R1, and how to evaluate reasoning fairly.
-
Chain of Thought, Self-consistency, and Search at Inference
A tour of the inference-time reasoning toolkit - from zero-shot CoT prompts to MCTS-decoded reasoning trees, and when each pays for itself.
intermediate · 8 min
-
Pro
DeepSeek-R1 and the Open Reasoning Recipe
How DeepSeek's R1 pipeline produced o1-class reasoning with an open paper, an open model, and a recipe other labs could replicate within weeks.
advanced · 9 min
-
Pro
OpenAI o1, o3 and the Reasoning-Model Family
What is publicly known and what is speculated about OpenAI's reasoning line, why the chain of thought is hidden, and what o3's ARC-AGI result actually proved.
intermediate · 8 min
-
Pro
Process Reward Models and Verifiable Rewards
Why scoring every step of a reasoning trace beats scoring only the final answer, and how Ai2 and DeepSeek replaced PRMs entirely with programmatic correctness checks.
advanced · 9 min
-
Reasoning Evals and the Contamination Problem
A guided tour of the reasoning benchmark canon, why each saturated faster than the field expected, and the move to live held-out evals as the contamination crisis bites.
intermediate · 7 min
-
Test-time Compute Scaling
Why "thinking longer" at inference can substitute for "training bigger", how the trade-off is operationalised as a token budget, and where the strategy stops paying.
intermediate · 7 min
Applied LLMs 1 concept
Putting models to work: prompting, retrieval-augmented generation, fine-tuning trade-offs, and agentic systems.
Agents & Tool Use 2 concepts
Function calling, ReAct, multi-step planning, and the workflow-vs-agent boundary that decides whether your design ships.
Evaluation & MLOps 6 concepts
Public benchmarks, HELM, custom evals, LLM-as-judge, red-teaming, model registries, and production monitoring.
-
Pro
Custom Evals and LLM-as-Judge
Why every production team eventually builds its own eval set, and how to use LLM judges without being fooled by their well-documented biases.
intermediate · 9 min
-
HELM and Holistic Evaluation
Why a single accuracy number is gameable, and how Stanford's HELM, BIG-bench, and lm-evaluation-harness push evaluation toward a multi-axis picture.
intermediate · 7 min
-
Model Registry, Lineage, and Reproducibility
The infrastructure that answers "which dataset, code, and hyperparameters produced this checkpoint?" - and why you only miss it the first time you cannot reproduce a model.
intermediate · 7 min
-
Pro
Production Monitoring and Drift Detection
How to catch silent regressions in deployed LLMs by monitoring input drift, output quality, and per-user randomised experiments before users tell you something is broken.
intermediate · 9 min
-
Public Benchmarks - MMLU, GPQA, HumanEval, MATH
A tour of the academic benchmarks that anchor frontier model launches, and why most of them are saturating, contaminated, or both.
intermediate · 8 min
-
Pro
Red-Teaming and Adversarial Evaluation
Why benign benchmark scores do not predict how a deployed model behaves under attack, and the human and automated methods used to find the failures first.
advanced · 9 min
Safety & Alignment 6 concepts
Prompt injection, jailbreaks, Constitutional AI, sycophancy, mechanistic interpretability, and frontier-alignment evaluations.
-
Alignment Evaluations and Frontier-Model Risk
How frontier labs and governments measure dangerous capabilities, what an eval-gated release looks like, and where the regulatory regime sits in 2026.
intermediate · 8 min
-
Pro
Constitutional AI and RLAIF
How Anthropic replaced human harmlessness labels with a written constitution and a critique-and-revise loop, and why this makes alignment auditable.
advanced · 9 min
-
Pro
Jailbreaks and Refusal Robustness
How attackers reliably bypass model refusal training, why post-hoc filters are necessary but never sufficient, and how AILuminate measures what remains.
intermediate · 9 min
-
Pro
Mechanistic Interpretability Primer
How sparse autoencoders extract human-interpretable features from model activations, what circuit-level analysis buys you for safety, and where the science is still contested.
advanced · 10 min
-
Prompt Injection
Why LLMs cannot reliably tell instructions from data, how indirect injection weaponises retrieved content, and which partial defences are worth deploying.
intermediate · 8 min
-
Pro
Sycophancy, Deception, and Reward Hacking
Why preference-trained models learn to please rather than to be right, what alignment faking is, and why evaluating during training can mislead you.
advanced · 9 min
LLM Systems 6 concepts
Vector databases, hybrid retrieval, prompt-cache infrastructure, gateways and routing, token accounting, and multi-tenant serving.
-
Hybrid Retrieval - BM25 + Vector + Reranking
Why pure vector search misses exact-match queries, how RRF combines lexical and semantic results, and where a cross-encoder reranker buys back the precision you lost.
intermediate · 8 min
-
Pro
LLM Gateways and Routing
Why every serious LLM deployment ends up behind a gateway, and how to choose between LiteLLM, Portkey, OpenRouter, and rolling your own.
intermediate · 9 min
-
Pro
Multi-Tenant Serving and Isolation
Serving many tenants from one model is cheap and easy; giving each tenant their own fine-tune is expensive and hard. S-LoRA and per-request LoRA serving collapse the trade-off, but only for tenants who can share a base model.
advanced · 10 min
-
Pro
Prompt Caching Infrastructure
How Anthropic, OpenAI, and vLLM let you reuse the KV cache of repeated prefixes, what the cache key actually is, and the patterns that turn cache hit rate into a real bill reduction.
intermediate · 9 min
-
Token Accounting, Billing, and Quotas
Why a single token counter is not enough, how to attribute spend across users and features without losing your mind, and the patterns that prevent one bad actor from spending the whole month's budget on a Tuesday afternoon.
intermediate · 8 min
-
Vector Databases Compared - pgvector, Qdrant, Milvus, Weaviate, LanceDB
A practitioner's guide to picking a vector store, weighing index trade-offs against the operational cost of running yet another database alongside your primary store.
intermediate · 9 min
Vision & Multimodal 6 concepts
Vision transformers, CLIP-style contrastive models, diffusion, segmentation, and the multimodal LLM stack that fuses text, image, audio, and video.
-
Contrastive Vision-Language: CLIP
How a 400M image-text contrastive objective produced a shared embedding space that does zero-shot classification, retrieval, and grounding without any task-specific labels.
intermediate · 8 min
-
Pro
Diffusion Models
How learning to invert a noise process became the dominant generative recipe for images, video, and audio, and why Flow Matching and DiTs are reshaping the recipe in 2024.
advanced · 10 min
-
Pro
Multimodal LLMs: LLaVA, Flamingo, GPT-4V
The vision-encoder-plus-projector-plus-LLM recipe that dominates open multimodal models, why Flamingo's perceiver design still matters for video, and what native-multimodal frontier models do differently.
advanced · 9 min
-
Pro
Segment Anything (SAM) and Dense Prediction
How promptable segmentation became a foundation-model task, what SAM's encoder-decoder split was designed for, and where it still loses to specialist models.
intermediate · 7 min
-
Video, Audio, and Any-to-Any Models
How Whisper, V-JEPA, Sora-class video generators, MusicGen, and unified any-to-any models extend the multimodal stack beyond static images.
intermediate · 9 min
-
Vision Transformers (ViT)
How treating an image as a sequence of patches let pure transformers beat CNNs once data crossed the 300M-image mark, and what the architecture gave up to get there.
intermediate · 8 min