Prompt Compression

A RAG prompt that stuffs ten retrieved chunks plus a long system instruction in front of every question can run 8,000 to 12,000 tokens. You pay for those tokens on every call, they eat into a finite context window, and prefill latency scales with their count. Prompt compression asks a blunt question: how many of those tokens actually carry information the model needs, and can you drop the rest without the answer getting worse? The answer, surprisingly often, is that most of them are redundant. Natural language is low-entropy; the model can reconstruct meaning from a fraction of the surface tokens.

Two families attack this. Hard (extractive) compression deletes low-information tokens from the text itself, keeping a shorter string of real words. Soft compression trains a model to fold a long instruction into a handful of learned vectors that stand in for it. They differ in what the target model has to know, and in what they cost you.

Prompt Compression

The economics that drive it

Keep reading with Pro.