Token Accounting, Billing, and Quotas

The provider invoice arrives. It is 4x your model. You spend two days reconstructing which feature ate the budget, only to discover one user's automated workflow ran in a loop for six hours. The fix is not "be more careful." The fix is to count tokens at the granularity you need to make decisions, and to enforce limits before the bill happens, not after.

The four token types you are billed for separately

Modern providers do not have one input-token price. They have four:

Token type	Typical price (relative)	When it accrues
Input (uncached)	1.0x	Standard input
Cache write	1.25x to 2.0x	First time you mark a prefix as cacheable (Anthropic)
Cache read	0.1x to 0.5x	Subsequent reuse of cached prefix
Output	3.0x to 5.0x	Generated tokens

Two implications people get wrong:

Output is the expensive bit. Anthropic Sonnet 4.5 outputs cost roughly 5x input. A chatty model that returns 2,000 tokens when it could return 200 costs you 10x more than tightening the prompt. Constrain output length aggressively (max_tokens, structured outputs, "respond in under 100 words").
Cache write is a one-time tax, cache read is the prize. If you write a cache but only read it once, you paid a 1.25x premium for nothing. Caching only earns out when the prefix is reused 3+ times within the TTL.

For accounting, log all four counters separately. Aggregating them into a single "tokens" number is the kind of false economy that costs you a week of debugging in month three.

Per-feature, per-user, per-org attribution

The minimum dimensions to attribute every LLM call:

@dataclass
class LLMUsageRecord:
    request_id: str
    timestamp_utc: datetime
    org_id: str           # billing entity
    user_id: str          # individual quota subject
    feature: str          # "chat", "summarise_doc", "agent_step_3"
    model: str            # "claude-sonnet-4-5"
    input_tokens: int     # uncached input
    cache_read_tokens: int
    cache_write_tokens: int
    output_tokens: int
    cost_usd: Decimal     # computed from price table at call time
    latency_ms: int
    cached_prefix_hash: str | None

Store these in an append-only table partitioned by day. Build all reports off this one table:

"Which feature spent the most last week?" -> GROUP BY feature
"Which user is approaching their daily limit?" -> GROUP BY user_id WHERE timestamp_utc > today_utc
"Cache hit rate per feature?" -> cache_read_tokens / (input_tokens + cache_read_tokens) GROUP BY feature
"Did the new prompt template increase cost per request?" -> compare pre/post deploy

The trick is computing cost_usd at call time using a price table you control, not asking the provider later. Provider invoices arrive monthly; you need cost-attributable telemetry within seconds.

Soft vs hard quotas

Quota type	Behaviour on hit	Use for
Soft	Log, alert, optionally degrade (smaller model, shorter output)	"You are approaching your monthly limit"
Hard	Block the request, return 429 with structured error	"You have exceeded your free-tier daily chat limit"

A working pattern for a paid SaaS:

def enforce_quota(user, requested_tokens_estimate):
    daily_used = redis.get(f"tokens:{user.id}:{utc_today}") or 0
    daily_limit = PLAN_LIMITS[user.plan]    # 50_000 free, 500_000 pro, 2_000_000 pro_max

    if daily_used + requested_tokens_estimate > daily_limit:
        raise QuotaExceeded(
            limit=daily_limit, used=daily_used, reset_at=utc_tomorrow_midnight
        )

    # Soft warning at 80%
    if daily_used > 0.8 * daily_limit:
        notify_user_async(user, "approaching daily limit")

For chat-like products, count by request (cheap, predictable), not by token (expensive to estimate up front). For document-processing products, count by token because the variance is real.

The noisy neighbour problem

If your service uses one upstream API key shared across all users, one user's runaway loop consumes the provider rate limit and every other user sees 429s. This is the classic noisy-neighbour problem and it shows up the first day you go to production.

Mitigations, in increasing order of cost:

Concurrency cap per user. Limit each user to N in-flight LLM requests. A semaphore in Redis, three lines of code. Stops the runaway-loop case dead.
Token-bucket per user. Smooth-out bursts; allow short bursts above the average but prevent sustained over-consumption.
Separate upstream keys per high-value tenant. Enterprise customers get a dedicated provider key so their rate limit is theirs alone. Operationally heavier; reserved for customers who pay for it.
Per-tenant inference cluster. The nuclear option - a dedicated vLLM deployment per tenant. Only justifiable for regulated workloads or very large customers.

The pattern that scales: shared key + per-user concurrency cap + per-tenant token-bucket. Dedicated keys only when a customer specifically pays for isolation.

Daily-reset patterns - UTC vs user-local

This sounds trivial. It is not.

UTC midnight reset. Simple to implement, one global counter rollover. A user in California sees their quota reset at 4 PM local time. Some users find this surprising; nobody who has run the system finds it ambiguous.
User-local midnight reset. Resets at midnight in each user's stated timezone. Friendlier UX, harder to implement (you now have 24+ rollover events per day, and DST shifts make some days have 23 or 25 hours).
Rolling 24-hour window. "You can use 50,000 tokens in any 24-hour period." Smoothest UX, requires a sorted-set in Redis with timestamped entries and a sliding-window query.

Pick UTC unless you have a specific reason not to. Document it in the API. The number of support tickets averted by saying "quotas reset at 00:00 UTC" in the docs is non-trivial.

When it falls down

You forgot streaming tokens. Streaming responses still cost output tokens. Count them after the stream completes, not at request initiation. Otherwise your usage numbers are systematically low.
You estimate tokens wrong. Token estimators (tiktoken, anthropic.count_tokens()) are accurate; naive character-count heuristics are not. Use the provider's tokeniser or accept 10-20% error in your pre-flight estimates.
You count input but not cached-input separately. A workload that is 90% cache hits looks expensive in a dashboard that aggregates input_tokens without separating cache reads. You will mis-prioritise optimisations.
Quota check and LLM call are not atomic. Two parallel requests both pass the quota check and then both consume tokens. Either atomically reserve tokens (Redis INCR) before the call, or accept some over-spend slack.