Linear Algebra for ML

Almost every operation a neural network performs is a matrix multiplication. The reason LoRA fine-tunes a 70B model with a few million parameters, the reason PCA compresses embeddings, the reason int8 quantisation barely hurts accuracy - all of it falls out of one fact: the matrices that show up in deep learning have most of their useful signal in a small number of singular directions. Linear algebra is what tells you which directions matter.

Vector spaces and linear maps

A vector space over the reals is a set closed under addition and scalar multiplication. The dimension is the size of a basis. Two facts dominate practice:

A linear map between vector spaces is fully described by its action on a basis. Once you fix bases on both sides, the map is a matrix.
Composing linear maps corresponds to multiplying their matrices. (A B) x = A (B x). Matmul is not arbitrary algebra; it is function composition.

This is why a stack of n linear layers with no non-linearity in between collapses to a single linear layer: composition of linear maps is linear. The non-linearity is the only thing buying you expressive depth.

Rank, null space, and what matrices actually do

The rank of a matrix is the dimension of its column space - the set of outputs it can produce. The null space is the set of inputs it sends to zero. Rank + nullity = number of columns.

In ML these are not abstractions:

A weight matrix of nominal shape (d, d) but effective rank r is doing a (d, r) * (r, d) factorisation in disguise. You are paying d^2 parameters for r * d worth of structure.
The Jacobian of a layer with rank deficiency has a null space - directions of input change the layer is blind to. Adversarial examples often live in or near these directions.

Eigendecomposition

If A is square and has d linearly independent eigenvectors, you can write:

A = V Lambda V^{-1}

Lambda is diagonal with eigenvalues; V has eigenvectors as columns. Applying A becomes "rotate into the eigenbasis, scale each axis by its eigenvalue, rotate back." Spectral radius (largest absolute eigenvalue) controls whether iterating A blows up or shrinks - the same fact that drives vanishing and exploding gradients in RNNs.

Eigendecomposition only exists for square matrices, and even then it can be numerically unstable when eigenvectors are nearly parallel. For real ML work you usually want SVD instead.

SVD - the workhorse

Every real matrix A of shape (m, n) factorises as:

A = U S V^T

U is (m, m) orthonormal - left singular vectors.
V is (n, n) orthonormal - right singular vectors.
S is (m, n) diagonal with non-negative singular values, sorted descending.

The singular values measure how much A stretches each orthogonal direction. The rank of A is the count of non-zero singular values. Truncating S to its top r entries gives the best rank-r approximation in Frobenius norm (Eckart-Young theorem). This single result is the mathematical backbone of:

Technique	What SVD says
PCA	Centre the data, take SVD; right singular vectors are principal axes
LoRA	Replace weight update `dW` with `B A` where `B`, `A` are `(d, r)` and `(r, d)`
Low-rank attention	Approximate softmax(QK^T) by a rank-`r` factorisation
Quantisation calibration	Singular spectrum tells you how much precision each direction needs
Model pruning	Drop the smallest singular components first

Why ML uses low-rank structure

Empirically, the update matrices that emerge during fine-tuning a large pretrained model have fast-decaying singular spectra. Most of the change in W lives in a tiny subspace. Hu et al's LoRA paper showed that rank 8 is often enough to recover full fine-tuning quality on downstream tasks. That is why a single H100 can fine-tune a 70B model: you train B A (a few hundred MB) and freeze the original 140 GB of weights.

The same low-rank prior makes int4 / int8 quantisation work. The directions that carry signal need precision; the rank-deficient directions can be aggressively rounded with no measurable loss.

The cost of matmul on real hardware

A naive (m, k) * (k, n) matmul is 2 m k n floating-point ops. On an H100 in BF16 you get ~990 TFLOPs of dense matmul throughput. The bottleneck is rarely FLOPs - it is memory bandwidth. Modern GPUs are arithmetic-intensity machines: they need many FLOPs per byte loaded from HBM. This is why:

Big matmuls dominate over small ones. Fusing many small multiplies into one large one is the single biggest performance lever in attention kernels.
Quantisation helps even when arithmetic is the same precision: less data to move per byte of compute.
Activation memory, not weight memory, is what limits batch sizes during training. Each forward pass produces gigabytes of intermediates that the backward pass needs to read back.

Common pitfalls

Eigendecomposition vs SVD on non-symmetric matrices. Eigenvectors can be complex, non-orthogonal, or degenerate. SVD always exists and is always orthonormal. Reach for SVD unless you have a symmetric positive-definite matrix.
Condition number. Ratio of largest to smallest singular value. A matrix with condition number 1e8 will lose 8 digits of precision when you invert or least-squares it. Many "my model NaNed during a tiny perturbation" bugs trace to ill-conditioned linear systems.
Forgetting the broadcasting cost. A @ x where A is (10000, 10000) and x is (10000,) is 100M ops but the load cost of A dominates. The same op on a batch of 1000 vectors is 1000x the FLOPs but only ~1x the memory traffic for A - dramatically cheaper per vector.