The Hidden Cost of AI: Why Inference Is the New Cloud Bill

In March 2023, a single API call to GPT-4 cost $30 per million input tokens. By April 2025, GPT-4.1 delivered comparable quality at $2 per million. A 15x price collapse in two years. Yet enterprise AI spending over the same period surged by roughly 320 percent. The math only works if you understand that cheaper tokens do not mean lower bills; they mean more tokens. This is the Jevons Paradox playing out in real time, and inference is the meter that never stops running.

The AI industry's public attention has fixated on training: the $100M cluster, the months-long run, the dramatic benchmark reveal. Training is loud. Inference is quiet, continuous, and increasingly where the money actually goes. For organisations that have moved past experimentation, inference already accounts for 60 to 90 percent of total AI compute spending (Epoch AI, 2025; Introl, 2025). The inference market alone reached an estimated $106 billion in 2025, projected to hit $255 billion by 2030 (TensorMesh, 2025).

Why this matters: Every organisation adopting AI is building a new recurring cost line, one that scales with usage, compounds with agentic workflows, and hides behind per-token pricing that obscures the real bill. Understanding inference economics is not an optimisation exercise; it is the difference between sustainable deployment and runaway spend.

TL;DR

Inference, not training, is the dominant cost for organisations deploying AI at scale. The split runs 60-90% inference in production environments.
Per-token prices have dropped roughly 50x per year at the median, yet total spending keeps climbing because cheaper tokens unlock more use cases.
Output tokens cost 2-5x more than input tokens across every major provider. Reasoning models amplify this further, generating 2-5x more output tokens per request.
Agentic workflows multiply token consumption by 10-100x over simple chat because every tool call resends the full conversation history.
Prompt caching, model routing, quantisation, and distillation can compound to 80-95% cost reductions when applied together.
A "K-shaped" pricing split is emerging: commodity inference races toward zero while reasoning-heavy workloads grow more expensive per task.
The environmental cost of inference (electricity, water, carbon) scales with usage and may soon exceed training's lifetime footprint.
The industry is shifting from per-token pricing toward outcome-based models, where you pay per resolved ticket rather than per generated token.

At a Glance

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '14px'}}}%%
flowchart LR
    subgraph Inputs["Cost Drivers"]
        T["Tokens consumed"]
        M["Model tier"]
        L["Latency target"]
        C["Context length"]
        A["Agent loops"]
    end
    subgraph Bill["The Inference Bill"]
        API["API spend"]
        HW["Hardware amortisation"]
        E["Energy + cooling"]
    end
    subgraph Levers["Optimisation Levers"]
        Cache["Prompt caching"]
        Route["Model routing"]
        Quant["Quantisation"]
        Dist["Distillation"]
        Batch["Batching"]
    end
    T --> Bill
    M --> Bill
    L --> Bill
    C --> Bill
    A --> Bill
    Bill --> Levers
    Levers -->|"80-95% reduction"| Bill

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

    class T,M,L,C,A blue
    class API,HW,E amber
    class Cache,Route,Quant,Dist,Batch emerald

Before the Meter Started

The cost of running a trained model was, for decades, an afterthought. A random forest scored a row in microseconds. A logistic regression cost nothing to serve. The entire concept of "inference economics" emerged only when models grew large enough that serving them required the same class of hardware used to train them.

%%{init: {'theme': 'base', 'themeVariables': {'cScale0': '#1e40af', 'cScale1': '#6d28d9', 'cScale2': '#b45309', 'cScale3': '#be123c', 'cScale4': '#047857', 'cScale5': '#0e7490', 'cScale6': '#6d28d9', 'cScaleLabel0': '#e2e8f0', 'cScaleLabel1': '#e2e8f0', 'cScaleLabel2': '#e2e8f0', 'cScaleLabel3': '#e2e8f0', 'cScaleLabel4': '#e2e8f0', 'cScaleLabel5': '#e2e8f0', 'cScaleLabel6': '#e2e8f0', 'textColor': '#e2e8f0', 'lineColor': '#94a3b8', 'fontSize': '14px'}}}%%
timeline
    title The Inference Cost Era
    2017 : Transformer architecture ships, inference costs negligible
    2020 : GPT-3 API launches at $60/M output tokens, first pay-per-token economy
    2022 : ChatGPT hits 100M users, inference volume becomes a business problem
    2023 : GPT-4 at $30/$60 per M tokens, enterprise sticker shock begins
    2024 : GPT-4o mini at $0.15/$0.60, Mixtral and Llama commoditise open inference
    2025 : GPT-4.1 at $2/$8, vLLM and PagedAttention mature, AI FinOps tools emerge
    2026 : Agentic workflows multiply token demand 10-100x, outcome-based pricing appears

The inflection came in 2022-2023 when ChatGPT turned inference from an engineering concern into a P&L line item. Suddenly, every API call had a dollar sign. The question shifted from "can the model do this?" to "can we afford for the model to do this, a million times a day?"

[IMAGE: Log-scale chart of GPT-class model pricing from March 2023 to June 2026, showing the 15x drop in frontier pricing and the 300x drop at the budget tier, with annotations for key model releases]

How the Bill Actually Works

The token is the unit of currency

Every major API provider charges per token, but the pricing asymmetry between input and output tokens is the first thing most teams underestimate. Output tokens cost 2-5x more than input tokens across every provider. Claude Opus 4.7 charges $5 per million input tokens but $25 per million output (Anthropic pricing). GPT-4.1 charges $2 input, $8 output (OpenAI pricing). This asymmetry exists because generation is autoregressive: each output token requires a full forward pass through the model, while input tokens can be processed in parallel.

The practical consequence: a system that generates long responses (code generation, report writing, chain-of-thought reasoning) pays a fundamentally different rate than one that classifies or extracts. Two applications using the same model can have 10x different unit economics depending on their output-to-input ratio.

Provider	Model	Input ($/M)	Output ($/M)	Output/Input Ratio
Anthropic	Opus 4.7	5.00	25.00	5.0x
Anthropic	Sonnet 4.6	3.00	15.00	5.0x
Anthropic	Haiku 4.5	1.00	5.00	5.0x
OpenAI	GPT-4.1	2.00	8.00	4.0x
OpenAI	GPT-4.1 nano	0.10	0.40	4.0x
Google	Gemini 2.5 Pro	1.25	10.00	8.0x
Google	Gemini 2.5 Flash	0.30	2.50	8.3x
DeepSeek	V3	0.14	0.28	2.0x
Groq	Llama 3.3 70B	0.59	0.79	1.3x

[IMAGE: Stacked bar chart comparing input vs output token costs across providers, making the asymmetry visually obvious]

The context window tax

Larger context windows are presented as a feature. They are also a cost multiplier. A 128K-token request costs 16x more than an 8K request at identical per-token rates. Some providers add surcharges: Gemini 2.5 Pro doubles its input price from $1.25 to $2.50 per million tokens above 200K context (Google pricing). Others, like Anthropic, charge a flat rate across the full million-token window but the total spend still scales linearly with tokens consumed.

The energy cost scales even more steeply. A 10K input token query consumes approximately 2.5 watt-hours. A 100K input query consumes roughly 40 Wh, a 16x energy increase for 10x more context (Epoch AI, 2025). And context length has a quality dimension too: models advertising 128K context may deliver frontier-quality output at 4K tokens but degraded quality at 128K. You are not buying the same model at every context length.

Reasoning models: thinking costs money

Reasoning models (OpenAI's o-series, Claude with extended thinking) generate internal chain-of-thought tokens before producing their answer. These thinking tokens are billed at output rates. A complex code review using extended thinking can consume 2-5x more output tokens than the same review with a standard model. A single prompt at maximum effort can burn 50K output tokens before producing a one-paragraph answer.

The counterintuitive finding: 41 percent of reasoning tokens can be eliminated on average without any accuracy loss. Chain-of-Draft, a technique that uses compact intermediate reasoning, matches standard chain-of-thought accuracy while consuming only 7.6 to 32 percent of the tokens (TianPan.co, 2026). Reasoning is worth its cost for multi-step arithmetic, logical deduction, and planning. For classification, extraction, and routing, it actively hurts: one study found accuracy dropped 17.2 percent when chain-of-thought was applied to a classification task.

[IMAGE: Side-by-side token breakdown of the same prompt answered by a standard model vs a reasoning model, showing the 3-5x token overhead from internal chain-of-thought]

Agent economics: the multiplier nobody budgeted for

The sharpest cost escalation in 2025-2026 comes from agentic workflows. A simple chat completion is one API call. An agent that browses the web, writes code, runs tests, and iterates on failures can issue 50 to 200 API calls per task. Each call resends the full conversation history because LLM APIs are stateless. The result: agentic coding workflows on benchmarks like SWE-bench average 1 to 3.5 million tokens per task (LeanOps, 2026).

Research from Stanford's Digital Economy Lab found that agentic tasks can consume 1,000x more tokens than simple reasoning tasks, with input tokens (not output) driving the cost because of repeated context transmission (Bai et al., 2026, arXiv:2604.22750). Token consumption for the same task can vary by 30x between models; on identical benchmarks, some models consumed over 1.5 million more tokens than others.

%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#1e40af', 'actorTextColor': '#fff', 'actorBorder': '#3b82f6', 'signalColor': '#94a3b8', 'signalTextColor': '#e2e8f0', 'noteBkgColor': '#1e293b', 'noteTextColor': '#e2e8f0', 'noteBorderColor': '#475569', 'activationBorderColor': '#3b82f6', 'activationBkgColor': '#1e3a5f', 'fontSize': '14px'}}}%%
sequenceDiagram
    participant User
    participant Agent
    participant LLM as LLM API
    participant Tool as External Tool
    
    User->>Agent: "Fix the failing test in auth.py"
    Agent->>LLM: System prompt + task (2K tokens)
    LLM-->>Agent: Plan + first tool call
    Agent->>Tool: Read file auth.py
    Tool-->>Agent: File contents (800 tokens)
    Agent->>LLM: Full history + file contents (3.5K tokens)
    LLM-->>Agent: Edit suggestion + run tests
    Agent->>Tool: Apply edit + run pytest
    Tool-->>Agent: Test output (1.2K tokens)
    Agent->>LLM: Full history + test output (5.5K tokens)
    Note over Agent,LLM: ... 15 more iterations ...
    Agent->>LLM: Full history (45K tokens)
    LLM-->>Agent: Final confirmation
    Note right of LLM: Total: ~350K tokens<br/>Cost at Sonnet 4.6: ~$6.30<br/>A 5-step loop: 3.2x base<br/>A 50-step loop: 30x+ base

[IMAGE: Heat map showing token consumption per step in a 20-step agent loop, illustrating how the cumulative context balloons with each iteration]

Seeing It in Motion

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '14px'}}}%%
flowchart TD
    subgraph Optimisation["The Inference Optimisation Stack"]
        direction TB
        L1["Layer 1: Model Selection"]
        L2["Layer 2: Serving Infrastructure"]
        L3["Layer 3: Request Optimisation"]
        L4["Layer 4: Application Design"]
    end

    L1 --> R1["Route to smallest capable model<br/>FrugalGPT: 98% cost reduction vs GPT-4"]
    L1 --> D1["Distill: 770M model beats 540B PaLM<br/>on ANLI - Google Research, 2023"]
    
    L2 --> V1["vLLM + PagedAttention<br/>2-4x throughput, under 4% memory waste"]
    L2 --> Q1["Quantisation: AWQ/GGUF<br/>92-95% quality at 4-bit"]
    L2 --> S1["Speculative decoding<br/>2-3x latency reduction, zero quality loss"]
    
    L3 --> C1["Prompt caching<br/>90% input cost reduction on cache hit"]
    L3 --> B1["Continuous batching<br/>Up to 36.9x throughput"]
    
    L4 --> A1["Compress agent context<br/>Summarise history instead of resending"]
    L4 --> T1["Tiered reasoning<br/>Chain-of-Draft: 7-32% of CoT tokens"]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

    class L1,R1,D1 blue
    class L2,V1,Q1,S1 purple
    class L3,C1,B1 teal
    class L4,A1,T1 emerald

The optimisation stack is layered, and the layers compound. A team that routes requests to the smallest capable model (Layer 1), serves that model with quantised weights on vLLM (Layer 2), caches repeated system prompts (Layer 3), and compresses agent context between steps (Layer 4) can achieve 80-95% cost reduction compared to naive deployment.

%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#1e40af', 'actorTextColor': '#fff', 'actorBorder': '#3b82f6', 'signalColor': '#94a3b8', 'signalTextColor': '#e2e8f0', 'labelBoxBkgColor': '#1e293b', 'labelBoxBorderColor': '#334155', 'labelTextColor': '#e2e8f0', 'loopTextColor': '#e2e8f0', 'noteBkgColor': '#1e293b', 'noteTextColor': '#e2e8f0', 'noteBorderColor': '#475569', 'activationBorderColor': '#3b82f6', 'activationBkgColor': '#1e3a5f', 'fontSize': '14px'}}}%%
sequenceDiagram
    participant Req as Incoming Request
    participant Router as Model Router
    participant Cache as Prompt Cache
    participant Small as Haiku / 8B Model
    participant Large as Opus / 70B Model

    Req->>Router: Classify complexity
    alt Simple (60-80% of traffic)
        Router->>Cache: Check cache
        alt Cache hit (84% after tuning)
            Cache-->>Req: Cached response (90% cheaper)
        else Cache miss
            Cache->>Small: Forward to small model
            Small-->>Req: Response at $1/M tokens
        end
    else Complex (20-40% of traffic)
        Router->>Cache: Check cache
        Cache->>Large: Forward to large model
        Large-->>Req: Response at $25/M tokens
    end
    Note right of Router: RouteLLM: 85% cost reduction<br/>on MT Bench while maintaining<br/>95% of GPT-4 quality

By the Numbers

The price collapse

The median inference price for a given level of benchmark performance has been declining at roughly 50x per year. When isolating data from January 2024 onward, the median accelerates to 200x per year, with the fastest trend reaching 900x per year (Epoch AI, LLM Inference Price Trends). Algorithmic efficiency alone contributes approximately 3x per year, independent of hardware improvements (Gundlach et al., 2025, The Price of Progress, arXiv:2511.23455).

Model	Date	Input $/M	Output $/M	vs GPT-4 (Mar 2023)
GPT-4	Mar 2023	30.00	60.00	baseline
GPT-4 Turbo	Nov 2023	10.00	30.00	2x cheaper
GPT-4o	May 2024	5.00	15.00	4x cheaper
GPT-4o mini	Jul 2024	0.15	0.60	100x cheaper
GPT-4.1	Apr 2025	2.00	8.00	7.5x cheaper
GPT-4.1 nano	Apr 2025	0.10	0.40	150x cheaper
DeepSeek V3	Dec 2024	0.14	0.28	214x cheaper

The environmental ledger

A standard ChatGPT query (roughly 500 output tokens on GPT-4o) consumes approximately 0.3 to 0.42 watt-hours, roughly 40 percent more than a Google Search query (Epoch AI, 2025). That sounds modest until you scale it: ChatGPT's daily inference load at one billion messages draws approximately 12.5 megawatts continuously. Global data centre electricity consumption hit 415 TWh in 2024 and is projected to reach 945 TWh by 2030, roughly 3 percent of global electricity (IEA, 2025). AI-focused data centre electricity surged 50 percent in 2025 alone.

Water consumption is the less-discussed cost. Processing a 100-word query on a large model uses roughly 519 millilitres of water when accounting for both direct cooling and indirect water use in electricity generation (Li et al., 2025, Cell Patterns). Google withdrew 6.4 billion gallons of water in 2023, with 95 percent going to data centre cooling.

[IMAGE: Sankey diagram showing energy flow from power grid through data centre cooling, compute, and networking, with AI inference highlighted as a growing share]

[IMAGE: World map showing carbon intensity of electricity grids by region, overlaid with major AI data centre locations, illustrating why the same inference query has different environmental costs depending on where it runs]

A Concrete Example

Consider a mid-sized SaaS company deploying an AI customer support agent. The agent handles 50,000 conversations per day, each averaging 8 turns. The system uses Claude Sonnet 4.6 ($3/M input, $15/M output) with a 4K-token system prompt, retrieves 2K tokens of context per turn from a knowledge base, and generates roughly 300 output tokens per response.

Naive deployment (no optimisation):

Per turn: 4,000 (system prompt) + 2,000 (RAG context) + average 1,500 (conversation history, growing each turn) = 7,500 input tokens. Output: 300 tokens.

Across 8 turns, conversation history accumulates. Total input tokens per conversation: approximately 76,000. Total output tokens: 2,400.

Daily input: 76,000 x 50,000 = 3.8B tokens. Daily output: 2,400 x 50,000 = 120M tokens.

Daily cost: (3,800 x $3) + (120 x $15) = $11,400 + $1,800 = $13,200/day, or roughly $396,000/month.

After optimisation:

Prompt caching (system prompt cached, 90% discount on cache hits): The 4K system prompt, repeated 400,000 times daily, drops from $4,800/day to $480/day. Saving: $4,320/day.
Model routing (route 70% of simple queries to Haiku 4.5 at $1/$5): Simple queries (order status, password resets) cost 70% less. Complex queries stay on Sonnet. Blended saving: roughly 50% on routed traffic.
Context compression (summarise conversation history after turn 4 instead of resending verbatim): Reduces accumulated history tokens by approximately 60%. Saving: roughly $2,400/day on input costs.
Output trimming (structured responses instead of verbose paragraphs): Output drops from 300 to 180 tokens average. Saving: 40% on output costs.

Combined optimised cost: approximately $3,800/day, or $114,000/month. A 71% reduction.

The gap between $396K and $114K per month is the difference between a viable product and one that cannot scale. And this example uses a single model for a single use case. Organisations running dozens of AI features across millions of users face these decisions at every layer.

[IMAGE: Waterfall chart showing the naive monthly cost of $396K being reduced step by step through caching, routing, compression, and output trimming to $114K]

Where It Breaks

The Jevons Paradox is real and accelerating. Per-token costs dropped approximately 1,000x in three years. Enterprise AI spending surged. Hyperscalers have committed $602 billion in capital expenditure for 2026, roughly 75 percent tied to AI infrastructure (Goldman Sachs, 2025). Cheaper inference removes financial gatekeeping; the median enterprise by end of 2025 was running AI inference across dozens of distinct use cases. The total bill grows even as the unit price shrinks.

Agent cost runaway has no natural ceiling. A 5-step agent loop costs 3.2x more than a single call. A 50-step loop exceeds 30x. A 200-step autonomous debugging session exceeds 100x. Without explicit token budgets and circuit breakers, a single runaway agent can consume more tokens than an entire day of normal traffic. Runs on the same task can differ by 30x in total tokens depending on the model.

Context window inflation creates hidden bills. Teams adopt 128K or 1M context windows because they can, stuffing entire codebases or document sets into prompts. The per-request cost inflates silently. Worse, quality degrades at extreme context lengths, so the team pays more for worse results.

FinOps maturity lags far behind adoption. Only 44 percent of organisations have adopted financial guardrails or AI FinOps practices, per a Gartner survey of 353 D&A/AI leaders in March 2026 (Waxell, 2026). Most AI cost tools provide visibility but not enforcement. The gap between "we can see the spend" and "we can control it" is where budgets blow up.

Alternative Designs

The core design decision for any team at scale is where to run inference and which models to use.

Strategy	Strengths	Weaknesses	Best when
Frontier API (Opus, GPT-4.1)	Highest quality, zero ops burden, instant scaling	Expensive at volume, vendor lock-in, unpredictable pricing changes	Low-to-medium volume, quality-critical tasks
Budget API (Haiku, Flash, DeepSeek V3)	90% quality at 10-20% cost, same zero-ops benefit	Quality ceiling on complex reasoning, still vendor-dependent	High-volume commodity tasks (classification, extraction)
Self-hosted open models (Llama, Mistral via vLLM)	Predictable costs at scale, full control, no data leaves your network	Engineering overhead ($270-550K/yr in staffing), GPU procurement risk	100M+ tokens/month against frontier pricing, strict data residency
Model routing (FrugalGPT, RouteLLM)	Best of both: quality where needed, savings everywhere else	Router itself adds latency and complexity, needs calibration	Mixed workloads with varying complexity
Distilled task-specific models	Extreme cost reduction (5-94%), often faster	Narrow capability, retraining cost on distribution shift	High-volume, well-defined tasks (Checkr: 5x savings with fine-tuned Llama-3-8B)

The break-even between self-hosted and API depends heavily on model size and utilisation. Against frontier APIs, self-hosting breaks even at roughly 100-256 million tokens per month. Against budget APIs like DeepSeek V3 at $0.14/M tokens, self-hosting rarely justifies itself at any volume. The hidden multiplier: raw GPU costs represent only 30-40 percent of true infrastructure investment; plan for 2.5-3x on networking, storage, cooling, and power (SitePoint, 2026).

[IMAGE: Decision tree for choosing between API, self-hosted, and hybrid inference strategies, with branch points at monthly token volume, data residency requirements, and quality thresholds]

How It Is Used in Practice

ProjectDiscovery cut LLM costs by 59 percent (rising to 70 percent) through prompt caching architecture alone. The key insight: moving dynamic content from the system prompt to the message tail eliminated cache invalidation, improving cache hit rates from 7 percent to 84 percent across 9.8 billion tokens (ProjectDiscovery, 2025).

Checkr replaced GPT-4 with a fine-tuned Llama-3-8B for background check classification, achieving 5x cost reduction with 90 percent accuracy and 30x faster inference (LeanLM, 2025). The pattern is consistent: for well-defined tasks with sufficient training data, a small fine-tuned model outperforms a general-purpose giant on both cost and latency.

Intercom shifted to outcome-based pricing for its Fin AI agent, charging $0.99 per resolved customer support conversation rather than per token (Intercom pricing). Zendesk adopted a similar model at $1.50 per automated resolution. This pricing structure aligns incentives: the vendor bears the inference cost risk, and the customer pays only for value delivered. At 100,000 monthly resolutions, the choice of vendor creates a $51,000/month difference.

The AI FinOps tooling ecosystem has matured rapidly. Helicone provides zero-code-change cost monitoring through an API proxy. Langfuse (acquired by ClickHouse in January 2026) offers open-source tracing. LiteLLM acts as an open-source gateway for multi-provider routing. The FinOps Foundation has a dedicated working group for AI cost management. But the tooling is still ahead of the practice: most organisations have cost visibility without cost enforcement.

[IMAGE: Dashboard mockup showing an AI FinOps view with per-model cost breakdown, cost-per-conversation metrics, cache hit rates, and model routing distribution]

Insights Worth Remembering

Output tokens are the expensive ones. The 2-5x pricing asymmetry between input and output means that reducing generation length (structured outputs, constrained decoding) often saves more than reducing prompt length.
Prompt caching is the single highest-leverage optimisation. A 90% input cost reduction on cache hits, achievable with architectural changes alone (as ProjectDiscovery demonstrated), often exceeds the savings from switching models entirely.
The model you need is smaller than you think. Google Research showed a 770M-parameter T5 model outperforming a 540B-parameter PaLM on specific tasks through distillation. The question is not "which model is best?" but "which model is best for this task?"
Agentic workflows broke the cost models. Token-per-request pricing assumed a request-response pattern. Agents that issue 50-200 calls per task, each resending the full history, expose a fundamental inefficiency in the stateless API design.
Cheaper inference does not mean lower bills. The Jevons Paradox is the defining dynamic of AI economics. Each cost reduction unlocks new use cases that collectively consume more than what was saved.
The environmental cost is real but widely misquoted. The "10x more energy than Google Search" claim was debunked; the real figure is closer to 1.4x. But at billions of queries per day, even 1.4x matters, and long-context queries with reasoning can consume 100x a simple search.
Self-hosting is rarely the answer for small teams. Engineering overhead alone ($270-550K/year in staffing) makes API-based inference cheaper until volumes exceed 100M+ tokens per month against frontier pricing.
Outcome-based pricing is the canary in the coal mine. When vendors start charging per resolution instead of per token, it means they have solved (or at least accepted) inference cost as their problem, not yours. Watch for this shift accelerating.

Open Questions

Will inference efficiency gains keep pace with demand growth? Epoch AI data shows algorithmic efficiency improving roughly 3x per year, and hardware improving roughly 30 percent annually. But demand for compute is rising 4-5x per year through 2030 (Deloitte, 2025). If demand outpaces efficiency, prices could plateau or even increase for frontier workloads despite continued improvement at the budget tier.

Can the environmental footprint be managed at scale? Data centre electricity is projected to hit 945 TWh by 2030. Whether smaller, more efficient models (MoE architectures that activate only 37B of 671B parameters, like DeepSeek V3) can offset the growth in total inference volume is an open empirical question. Water consumption for cooling is a growing concern in water-stressed regions; some facilities are shifting to air cooling and heat recapture, but adoption lags.

Will outcome-based pricing replace token-based pricing? The early movers (Intercom, Zendesk, Sierra AI) have shown that per-resolution pricing works for customer support. Whether this model extends to code generation, content creation, and analytical workloads, where "resolution" is harder to define, remains to be seen. Hybrid pricing models (base subscription plus per-outcome overage) rose from 27 to 41 percent market adoption in 12 months.

How will agentic cost control evolve? Current agents have no built-in sense of cost. Token budgets, circuit breakers, and cost-aware planning are emerging in research but are not yet standard in production frameworks. The gap between an agent that solves a problem and an agent that solves it economically is the next frontier of agent design.

What happens when everyone runs agents? If the median enterprise query shifts from single-turn chat (hundreds of tokens) to multi-step agentic workflows (hundreds of thousands of tokens), total inference demand could grow by orders of magnitude beyond current projections, even as per-token prices continue to fall.

Sources and Further Reading