← Blog

Four Bits Per Weight: How Low-Precision Quantization Stopped Hurting LLMs

June 04, 2026 · 22 min read

In late 2022, if you tried to squeeze a 175-billion-parameter language model from 16 bits per weight down to 4, you got back a model that babbled. Round-to-nearest quantization, the obvious approach, raised perplexity on OPT-175B by 4.53 points, enough to make a 175B model perform worse than a full-precision model nearly fifty times smaller (Frantar et al., 2022, GPTQ, arXiv:2210.17323). The weights fit in a quarter of the memory, and the model was useless.

Within a year, that gap had nearly closed. The same paper's method, GPTQ, lost only 0.14 perplexity at 4 bits on the same model. Today, 4-bit checkpoints are the default way most people run open-weight models on a single GPU, and hardware vendors have started building silicon that multiplies 4-bit numbers natively. The story of how rounding got from "catastrophic" to "free" is mostly a story about one stubborn phenomenon: a handful of weights and activations that refuse to be small.

Why this matters: A modern LLM spends most of its inference time not computing but waiting, idling while billions of weights stream from memory into the compute units. Quantization attacks that bottleneck directly by shrinking each weight. The catch is that naive shrinking destroys the model, and understanding why tells you most of what you need to know about how LLMs actually store information.

TL;DR

  • LLM inference is memory-bound, not compute-bound. The dominant cost of generating a token is reading every weight out of GPU memory, so halving the bits per weight roughly halves the time and the footprint.
  • Naive rounding to 4 bits works fine up to about 6.7B parameters, then collapses. The cause is outlier features: a small number of activation dimensions with magnitudes 20 to 100 times the rest, which appear in every layer of large models (Dettmers et al., 2022, arXiv:2208.07339).
  • The breakthrough methods do not quantize each weight in isolation. GPTQ minimizes the output error of each layer using second-order information; AWQ protects the ~1% of weights that matter most by scaling them before rounding.
  • Rotation-based methods (QuaRot, SpinQuant) take a different route: they mathematically spin the activations so the outliers spread out evenly, making everything easy to quantize at once.
  • 8-bit weights with 8-bit activations (W8A8) is now essentially lossless. 4-bit weights are near-lossless. 4-bit activations and below are still where quality starts to fray.
  • FP4 has moved from research to hardware. NVIDIA's Blackwell GPUs multiply 4-bit floats natively at roughly twice the throughput of 8-bit, which changes the calculus from "save memory" to "also save compute."

At a Glance

flowchart LR
  W["FP16 weights<br/>140 GB"]:::blue --> Q{"Quantize<br/>how?"}:::purple
  Q -->|"Round each weight"| RTN["RTN<br/>fast, breaks at scale"]:::rose
  Q -->|"Minimize layer output error"| GPTQ["GPTQ<br/>4-bit, near-lossless"]:::emerald
  Q -->|"Protect salient weights"| AWQ["AWQ<br/>4-bit, robust"]:::emerald
  Q -->|"Rotate away outliers"| ROT["QuaRot / SpinQuant<br/>4-bit W+A+KV"]:::emerald
  GPTQ --> OUT["35 GB checkpoint<br/>2-4x faster decode"]:::teal
  AWQ --> OUT
  ROT --> OUT
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff

Every modern quantization method is an answer to the same question: given that you must round billions of high-precision numbers to a coarse grid, how do you choose the rounding so the model's behavior survives? The naive answer (round each weight to its nearest grid point) is the one that fails. The rest of this article is about the better answers.

Before 4-Bit

Quantization is older than deep learning. Fixed-point arithmetic and integer DSP code have rounded floats to save space and power for decades. What changed with LLMs is the stakes: the models got so large that memory, not arithmetic, became the wall, and so large that the old rounding tricks stopped working.

timeline
  title Evolution of LLM quantization
  2017 : Transformer introduced : FP32 / FP16 training is the norm
  2019 : INT8 post-training quantization works well for CNNs and small Transformers
  2022 : LLM.int8 finds emergent outlier features at 6.7B
  2022 : GPTQ quantizes 175B models to 3-4 bit in ~4 GPU hours
  2022 : SmoothQuant migrates activation outliers into weights for W8A8
  2023 : AWQ protects salient weights; 4-bit becomes the community default
  2024 : QuaRot and SpinQuant use rotations for 4-bit weights, activations, and KV cache
  2025 : Blackwell ships native FP4 tensor cores; FP4 moves into hardware

The pivotal discovery came from trying to make 8-bit work and watching it fail. INT8 quantization had been a solved problem for vision models, so the expectation was that it would transfer cleanly to Transformers. It did, up to a point. Tim Dettmers and colleagues found that quantizing the feed-forward and attention projection matrices to INT8 worked beautifully until models crossed roughly 6.7B parameters, at which point accuracy fell off a cliff (Dettmers et al., 2022, arXiv:2208.07339).

[IMAGE: Heatmap of activation magnitudes across feature dimensions for a 6.7B model, with a few bright vertical stripes marking the outlier features that dwarf everything else]

The cause was not a bug. It was a property of how large Transformers represent information. Starting at a predictable scale, a small set of feature dimensions, often fewer than ten out of thousands, develop activation magnitudes vastly larger than everything around them. These outlier features are not noise; the model relies on them, and knocking them out by clumsy rounding degrades it sharply. Quantization, it turned out, was not really an arithmetic problem. It was a problem about which numbers the model cares about most.

How Quantization Actually Works

The grid

Quantization replaces a high-precision number with the nearest value on a coarse, evenly spaced (or float-spaced) grid, plus a scale factor that maps the grid back to the original range. For symmetric integer quantization of a weight tensor \(W\), you pick a scale \(s\) and store integers:

\[q = \text{round}\!\left(\frac{W}{s}\right), \qquad \hat{W} = s \cdot q\]

with \(s = \frac{\max(|W|)}{2^{b-1} - 1}\) for \(b\)-bit signed integers. A 4-bit signed grid has just 15 distinct levels. The quantization error on any single weight is at most \(s/2\), but the error that matters is not per-weight; it is what happens to the layer's output once thousands of these errors accumulate through a matrix multiply.

[IMAGE: A weight distribution histogram (roughly Gaussian) with a 4-bit and an 8-bit quantization grid overlaid, showing how few levels 4-bit offers and how the tails get clipped]

The choice of grid is itself a design axis. Per-tensor quantization uses one scale for an entire weight matrix; per-channel uses one scale per output channel; group-wise uses one scale per block of (say) 128 weights. Finer granularity costs a little storage for the extra scales but tracks local structure far better. Most production 4-bit formats are group-wise with a group size of 64 or 128.

Why round-to-nearest is the wrong objective

Round-to-nearest (RTN) minimizes the error on each weight independently. But a layer does not output its weights; it outputs \(WX\) for some activation \(X\). Two weights with the same rounding error contribute very differently to the output depending on how large the activations they multiply are. RTN is blind to this. It spends its limited precision budget evenly when it should spend it where the activations are large.

This is the entire game. Every method that beats RTN does so by being activation-aware in some form, either by reconstructing the layer output directly or by protecting the weights that ride on big activations.

GPTQ: minimize the output error, weight by weight

GPTQ frames quantization as a per-layer reconstruction problem. For each linear layer it seeks quantized weights \(\hat{W}\) that minimize the squared error of the layer's output on a small calibration set:

\[\arg\min_{\hat{W}} \; \lVert WX - \hat{W}X \rVert_2^2\]

This is the Optimal Brain Quantization objective, descended from the classic Optimal Brain Surgeon pruning work. Solving it exactly is intractable for billions of weights, so GPTQ quantizes one column at a time and, crucially, after fixing each column's value it updates all the not-yet-quantized columns to compensate for the error just introduced. The compensation uses second-order information from the Hessian \(H = XX^\top\), which captures how the weights interact through the activations.

flowchart TD
  Start["Layer weights W<br/>+ calibration activations"]:::blue --> H["Compute Hessian<br/>H = X Xᵀ"]:::purple
  H --> Pick["Take next weight column"]:::purple
  Pick --> Round["Quantize this column"]:::purple
  Round --> Comp["Update remaining columns<br/>to absorb the error"]:::amber
  Comp --> More{"Columns left?"}:::slate
  More -->|"yes"| Pick
  More -->|"no"| Done["Quantized layer<br/>output error minimized"]:::emerald
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The error-compensation step is what lets GPTQ tolerate a brutally coarse grid. A weight rounded badly is not a permanent loss; its error is pushed forward and partially cancelled by adjusting its neighbors. The result was the first method to quantize OPT-175B and BLOOM-176B to 3 and 4 bits in about four GPU hours with minimal perplexity loss (Frantar et al., 2022, arXiv:2210.17323). GPTQ never touches activations and needs only a few hundred calibration samples, which is why it became a workhorse.

AWQ: protect the weights that matter

AWQ starts from a sharp observation: not all weights are equally important, and you can tell which ones matter by looking at the activations they multiply, not the weights themselves. Roughly 1% of weight channels, those aligned with large-magnitude activation features, account for a disproportionate share of the model's accuracy. Keep those few in higher precision (or otherwise protect them) and 4-bit quantization barely hurts (Lin et al., 2023, AWQ, arXiv:2306.00978).

Rather than store a mixed-precision mess, AWQ achieves the protection with a per-channel scaling trick. It scales up the salient weight channels (and scales the corresponding activations down by the same factor, leaving the math unchanged) before quantizing. A larger weight occupies more of the quantization grid's range, so it is rounded more finely in relative terms. The genius is that this is a pure pre-processing transform: no mixed precision at inference, no custom kernels for outliers, just a better assignment of the grid.

flowchart LR
  A["Activations X"]:::blue --> S["Find salient channels<br/>by activation magnitude"]:::purple
  W2["Weights W"]:::blue --> S
  S --> Scale["Scale salient channels up<br/>(activations down to match)"]:::amber
  Scale --> Q["Quantize to 4-bit"]:::purple
  Q --> Out["Robust 4-bit weights"]:::emerald
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

Because AWQ keys off activation statistics rather than reconstructing one specific calibration set, it tends to generalize better when the deployment distribution differs from the calibration data, an important practical property when you do not know in advance what users will type.

The activation problem and the rotation trick

GPTQ and AWQ both quantize weights and leave activations in higher precision. That captures most of the memory savings, because weights dominate the footprint, but it leaves the arithmetic in 16-bit. To run the matrix multiplies themselves in low precision (and to compress the KV cache, which can dominate memory for long contexts), you must quantize activations too, and that is where the outliers bite hardest.

SmoothQuant tackled this for 8-bit by migrating the difficulty: a mathematically equivalent rescaling shifts the dynamic range from the hard-to-quantize activations into the easy-to-quantize weights, enabling full W8A8 inference on models up to 530B (Xiao et al., 2022, SmoothQuant, arXiv:2211.10438).

For 4-bit activations, a more aggressive idea won: rotate the outliers away. If you multiply the activations by a random orthogonal matrix (a Hadamard rotation) and fold the inverse into the weights, the model's output is unchanged, but the outlier energy that was concentrated in a few dimensions gets smeared evenly across all of them. A distribution with no spikes is easy to quantize. QuaRot showed this enables outlier-free 4-bit inference across weights, activations, and KV cache, with LLaMA2-70B losing at most 0.47 WikiText-2 perplexity and retaining 99% of zero-shot accuracy (Ashkboos et al., 2024, QuaRot, arXiv:2404.00456). SpinQuant then showed that learning the rotation, rather than using a fixed random one, closes the gap further, narrowing LLaMA-2 7B's 4-bit accuracy gap to 2.9 points on zero-shot reasoning (Liu et al., 2024, SpinQuant, arXiv:2405.16406).

[IMAGE: Side-by-side activation distributions, before and after a Hadamard rotation, showing spiky outliers on the left flattening into a smooth bell on the right]

Integers versus floats at 4 bits

There are two ways to spend four bits. INT4 places 16 levels at equal spacing. FP4 places them at floating-point spacing: more levels near zero, fewer in the tails. Since weight and activation distributions are roughly bell-shaped and centered on zero, the float layout matches them better, putting precision where the mass is. The practical formats add a twist called microscaling: a shared scale factor for each small block of values. MXFP4 uses blocks of 32 with an 8-bit power-of-two (E8M0) scale; NVFP4 uses blocks of 16 with a higher-fidelity FP8 (E4M3) scale, which tracks local range more tightly at the cost of more scale storage.

[IMAGE: Number line comparing INT4 (16 evenly spaced ticks) against FP4 (ticks clustered densely near zero, sparse in the tails), overlaid on a bell-shaped weight distribution]

Seeing It in Motion

The diagram below traces a single decode step through a 4-bit model, showing where precision changes along the path.

sequenceDiagram
  participant M as GPU memory
  participant K as Compute kernel
  participant A as Activations
  M->>K: Stream 4-bit weight block + scale
  K->>K: Dequantize block to FP16 (or compute in FP4)
  A->>K: FP16 activations (or quantized)
  K->>K: Matrix multiply
  K->>A: FP16 output for next layer
  Note over M,A: Repeat per layer, per token. Weight read dominates time.

On weight-only setups, the kernel reads 4-bit weights, expands them to 16-bit on the fly, and multiplies against 16-bit activations. The memory traffic is cut by roughly 4x even though the math is still 16-bit, and since the step was memory-bound, that traffic cut is the speedup. On native FP4 hardware, the dequantize step disappears and the multiply itself runs in 4-bit.

The decision of which precision to deploy is itself a small flowchart, driven by hardware and tolerance for risk.

stateDiagram-v2
  [*] --> Choose
  Choose --> W8A8: need lossless, have INT8/FP8 HW
  Choose --> W4A16: single-GPU fit, weights dominate
  Choose --> W4A4: long context or compute-bound, FP4 HW
  W8A8 --> Deploy: ~lossless
  W4A16 --> Deploy: near-lossless, GPTQ/AWQ
  W4A4 --> Validate: needs rotation + careful eval
  Validate --> Deploy: if gap acceptable
  Validate --> W4A16: if quality regresses
  Deploy --> [*]

By the Numbers

The headline figures, all from the primary sources, show how sharply the methods separate at low bit-widths. Perplexity is on WikiText-2 unless noted; lower is better, and a difference of a few tenths is meaningful at this scale.

Setup Method Reported result Source
OPT-175B, 4-bit weights RTN +4.53 perplexity vs FP16 GPTQ, arXiv:2210.17323
OPT-175B, 4-bit weights GPTQ +0.14 perplexity vs FP16 GPTQ, arXiv:2210.17323
OPT/BLOOM 175B, 3-4 bit GPTQ quantized in ~4 GPU hours GPTQ, arXiv:2210.17323
LLaMA2-70B, 4-bit W+A+KV QuaRot \(\le\) 0.47 perplexity loss, 99% zero-shot QuaRot, arXiv:2404.00456
LLaMA-2 7B, 4-bit W+A+KV SpinQuant 2.9 point zero-shot gap to FP16 SpinQuant, arXiv:2405.16406

The memory and throughput arithmetic is more intuitive. For a dense decoder model, the weight footprint is simply parameters times bytes-per-weight, and decode latency tracks that footprint because the step is memory-bound.

Precision Bytes/weight 70B weight footprint Relative decode cost
FP16 2 ~140 GB 1.0x (baseline)
INT8 / FP8 1 ~70 GB ~0.5x
INT4 / FP4 0.5 ~35 GB ~0.25x (weight-only, approximate)

The 4x weight-footprint reduction is exact; the decode speedup is approximate because real kernels add dequantization overhead, scale storage (group-wise quantization adds roughly 0.1 to 0.5 bits per weight depending on group size), and the activation/KV-cache traffic that quantization may not touch. On hardware with native low-precision tensor cores, the compute side also accelerates: Blackwell's FP4 tensor cores deliver roughly twice the throughput of FP8 on the same chip, which is why FP4 is attractive for compute-bound prefill, not only memory-bound decode.

[IMAGE: Bar chart of perplexity for RTN vs GPTQ vs AWQ across model sizes 1.3B to 175B, showing RTN diverging at scale while GPTQ/AWQ stay flat]

[IMAGE: Stacked memory bar comparing FP16, INT8, INT4 for a 70B model, broken into weights, KV cache, and activations, annotated with what each method does and does not compress]

A Concrete Example

Take a single weight vector from one row of a projection matrix and quantize it to 4-bit symmetric integers, the naive way and the activation-aware way, to see where RTN loses.

Suppose the row is \(w = [0.9,\ -0.1,\ 0.2,\ -0.05]\) and it multiplies an activation vector \(x = [0.1,\ 8.0,\ 0.1,\ 0.1]\). Notice that the second activation is an outlier, eighty times the others. The true dot product is:

\[w \cdot x = (0.9)(0.1) + (-0.1)(8.0) + (0.2)(0.1) + (-0.05)(0.1) = 0.09 - 0.80 + 0.02 - 0.005 = -0.695\]

Now quantize \(w\) to INT4. The scale is \(s = \max(|w|) / 7 = 0.9 / 7 \approx 0.1286\). Rounding each weight to the nearest multiple of \(s\):

Weight \(w/s\) Rounded \(q\) \(\hat{w} = sq\) Error
0.90 7.00 7 0.900 0.000
-0.10 -0.78 -1 -0.129 -0.029
0.20 1.56 2 0.257 +0.057
-0.05 -0.39 0 0.000 +0.050

The largest weight error, 0.057, is on the third element. But look at the output. The second weight, with a modest error of -0.029, multiplies the outlier activation of 8.0, so it injects \((-0.029)(8.0) = -0.23\) of output error all by itself. The third weight's larger error of 0.057 multiplies only 0.1, contributing a negligible 0.006. The quantized dot product comes out near \(-0.92\), off by about 0.23, almost entirely from one weight that RTN treated as unimportant.

An activation-aware method sees this coming. AWQ would notice that the second channel rides a large activation and scale that weight up before quantizing so it lands on a finer part of the grid; GPTQ would compensate for the second weight's error by nudging the others. Either way, the precision budget shifts toward the weight that the activation makes expensive. That single reallocation, multiplied across billions of weights, is the entire difference between a 4-bit model that works and one that babbles.

Where It Breaks

Weight-only 4-bit quantization is close to a free lunch, but the lunch has fine print.

Activations are harder than weights, and the KV cache is the quietest casualty. Weights are static and can be analyzed offline; activations are dynamic, input-dependent, and full of outliers that move. Pushing activations to 4 bits without rotations typically costs real accuracy, and quantizing the KV cache too aggressively degrades long-context recall in ways a short benchmark will not catch.

Calibration data leaks into behavior. GPTQ and AWQ both fit to a small calibration set. If that set is unrepresentative (all English, all code, all short), the model can quantize "well" on paper and underperform on the real distribution. AWQ is more robust here than GPTQ, but neither is immune.

Perplexity hides damage. A model can hold its WikiText perplexity nearly flat while losing measurable ground on multi-step reasoning, instruction following, or rare-language generation. The literature increasingly evaluates on zero-shot task suites precisely because perplexity is too forgiving. Treat a single perplexity number as a smoke test, not a verdict.

Small models quantize worse than large ones. This is counterintuitive but consistent: large models are more redundant and have more room to absorb rounding error, while a 1 to 3B model has less slack. The reasoning-heavy small models popular for on-device use are exactly the ones where 4-bit can visibly hurt.

Outliers are a moving target. Each new architecture (different normalization, different activation functions, MoE routing) can relocate or reshape the outlier structure, so a recipe tuned for one model family is not guaranteed to transfer. LLaMA-3, for instance, proved harder to quantize than LLaMA-2, which is part of why learned rotations emerged.

Alternative Designs

The methods are less rivals than members of a toolkit, each strongest in a different regime.

Approach Strengths Weaknesses Best when
RTN (round-to-nearest) Instant, no calibration Collapses above ~6.7B at 4-bit Quick 8-bit; never 4-bit at scale
LLM.int8() Lossless 8-bit, handles outliers explicitly Mixed-precision path is slower than pure INT8 8-bit serving where accuracy is non-negotiable
SmoothQuant Full W8A8, fast INT8 math 8-bit only; needs activation calibration Throughput-bound 8-bit deployment
GPTQ Excellent 4-bit weights, fast to apply Weight-only; sensitive to calibration set Single-GPU weight-only 4-bit
AWQ Robust 4-bit, generalizes across distributions Weight-only; needs activation statistics 4-bit where deployment data is uncertain
QuaRot / SpinQuant Enables 4-bit weights, activations, and KV More complex; rotation overhead; eval-heavy Aggressive W4A4, long-context compression

Two camps run underneath this table. Post-training quantization (everything above) takes a finished model and compresses it with at most a little calibration. Quantization-aware training instead simulates quantization during training or fine-tuning so the model learns weights that round well, which can recover more accuracy at very low bit-widths but costs a training run. For most deployments PTQ wins on effort-to-quality; QAT earns its keep at 2 to 3 bits or for models that must run in the lowest possible precision on constrained hardware.

How It Is Used in Practice

Quantization is no longer exotic; it is the default path from a research checkpoint to something you can actually serve. The open-weight ecosystem ships GPTQ and AWQ checkpoints as a matter of course, and the major inference servers (vLLM, TensorRT-LLM, and others) load them directly, often pairing the quantized weights with optimized kernels such as Marlin for 4-bit matrix multiplies. The common production recipe is weight-only 4-bit (W4A16) for decode-heavy chat workloads, because it captures the memory win with the least accuracy risk and needs no activation calibration.

The hardware has started to meet the software. NVIDIA's Blackwell generation added native FP4 tensor cores, and the surrounding stacks (TensorRT-LLM, vLLM) added NVFP4 paths to exploit them. This turns quantization from a pure memory optimization into a compute optimization, which matters most for the prefill phase of long prompts, where the workload is compute-bound. Teams treat the bit-width as a tuning knob: 8-bit when accuracy must be untouched, 4-bit weights as the everyday default, and 4-bit activations only after a careful evaluation gate, because the failure mode there is subtle quality loss rather than an obvious crash.

[IMAGE: A deployment pipeline diagram from FP16 checkpoint to served endpoint, with branches for W8A8, W4A16, and W4A4, annotated with the calibration and validation steps each requires]

Insights Worth Remembering

  • Quantization error is not about the weights; it is about the outputs. A weight's rounding error matters in proportion to the activation it multiplies, which is why every good method is activation-aware in some form.
  • The 6.7B cliff is a property of the model, not the math. Large Transformers concentrate information in a few outlier dimensions, and quantization that ignores them fails for the same reason the model would fail if you ablated those dimensions.
  • Memory bandwidth, not FLOPs, is what 4-bit buys back on decode. The speedup comes from moving fewer bytes, which is why weight-only quantization speeds up generation even when the arithmetic stays 16-bit.
  • Rotations turn a hard distribution into an easy one for free. Spreading outlier energy across all dimensions with an orthogonal transform is mathematically lossless and is the key that unlocked 4-bit activations.
  • Float beats integer at four bits because the data is bell-shaped. FP4 puts levels where the values cluster; that is most of why FP4 hardware is worth building.
  • A flat perplexity number can hide a real regression in reasoning. Evaluate quantized models on the tasks you actually care about, not just language-modeling loss.
  • Bigger models are more forgiving. The same recipe that is near-lossless on 70B can visibly hurt a 3B model, so do not assume a small on-device model tolerates 4-bit as gracefully.

Open Questions

The frontier has clearly moved below 4 bits, but how far is unresolved. Methods exist that push to 3 and even 2 bits, yet the accuracy cost there is real and not yet reliably tamed for general use; whether 2-bit weight-only becomes a safe default or stays a specialist tool is an open empirical question.

Activation and KV-cache quantization at 4 bits and below is the live research front. Weight quantization is largely solved; activations are not, and long-context workloads make KV-cache compression increasingly urgent. Rotation methods help, but the interaction between aggressive KV quantization and long-range recall is not fully characterized, and it is plausible that the right answer is non-uniform precision across the context rather than a single bit-width.

The relationship between quantization and reasoning is poorly understood. There is mounting anecdotal and benchmark evidence that low-bit quantization disproportionately affects multi-step reasoning relative to single-shot tasks, but a clean mechanistic account of why (and a recipe that protects reasoning specifically) does not yet exist. This is inference rather than established fact: the evaluations are suggestive, not conclusive.

Finally, hardware and algorithms are now co-evolving. Native FP4 silicon was designed around the formats researchers found worked; the next round of formats (and the question of whether some future microscaling layout becomes the standard) will be shaped as much by what tensor cores can multiply cheaply as by what minimizes error. The likely near-term development is convergence on a small number of hardware-blessed 4-bit float formats, with INT4 persisting mainly where no such hardware is available.

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Additional Resources

Sign in to save and react.
Share Copied