← Concept library

Foundations

Prompt Compression

Cutting prompt tokens while holding task performance, via perplexity-based token dropping (LLMLingua) or learned gist tokens, and when prompt caching beats both.

advanced · 8 min read · Premium

A RAG prompt that stuffs ten retrieved chunks plus a long system instruction in front of every question can run 8,000 to 12,000 tokens. You pay for those tokens on every call, they eat into a finite context window, and prefill latency scales with their count. Prompt compression asks a blunt question: how many of those tokens actually carry information the model needs, and can you drop the rest without the answer getting worse? The answer, surprisingly often, is that most of them are redundant. Natural language is low-entropy; the model can reconstruct meaning from a fraction of the surface tokens.

Two families attack this. Hard (extractive) compression deletes low-information tokens from the text itself, keeping a shorter string of real words. Soft compression trains a model to fold a long instruction into a handful of learned vectors that stand in for it. They differ in what the target model has to know, and in what they cost you.

The economics that drive it

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied