Compression without degradation · Works with Anthropic, OpenAI & Cursor

Cut your LLM costs
by 40–70%.
Change one URL.

TokenMinMax compresses every LLM request automatically — 95% of context retained, zero quality loss. Short conversations save 40%. Long sessions save up to 70%. Works with Anthropic, OpenAI, Claude Code, and Cursor. No SDK changes required.

Start 7-day free trial → See pricing
40–70%
Token savings · 95% context retained
<100ms
Added latency
4 tiers
4-tier compression pipeline
1 line
Integration change

Four-tier compression pipeline

Every request passes through a sequential pipeline before reaching your LLM provider. Always fails open — if any tier fails, the request proceeds unaffected.

Tier 1
Free
Rule-based cleanup
Strips filler phrases, repeated acknowledgments, and redundant whitespace. Runs synchronously on every request — no external calls, no added latency.
~5%
~5ms · free
Tier 2
Free
Entropy scoring
Scores each message by information density. Low-entropy messages (greetings, one-word acks) are dropped or shortened. Code blocks, numbers, and decisions are always protected.
~5%
~5ms · free
Tier 3
included
ML token pruning
ML token pruning on contexts above 4,000 tokens. Code-heavy messages are automatically detected and skipped — code is never sent to the pruning model.
~10%
~100ms
Tier 4
included
Session summarisation
Runs async after the response — zero latency impact. Builds a rolling summary of older conversation turns and injects it as a read-only prefix. Tier 1/Tier 2/Tier 3 only ever see new messages, never the summary. Best for long multi-turn chats, agentic workflows, Claude Code and Cursor sessions where history compounds with every request.
30–50%
+0ms · async after response

Everything you need, nothing you don't

Full streaming support
Compression happens before the request. Streaming passthrough is unmodified — no buffering, no UX impact. Accurate token stats appended as a trailing SSE event.
🛡
Content protection
Code, numbers, legal text, and other high-value content are detected and skipped by Tier 3 — never sent to the pruning model. A safety mechanism ensures protected content is always preserved in the Tier 4 session summary even if it ages out of the active context.
📊
Savings dashboard
Real-time per-request stats. Input token savings, dollar savings by model, and per-tier breakdown — Tier 1/Tier 2, Tier 3, and Tier 4 tracked separately.
🔒
Privacy Mode
One header disables all external compression APIs. Data flows only through the proxy to your LLM provider — no third-party processors.
🖱
Cursor + Claude Code
Set your base URL and every request is automatically compressed. Works with Cursor settings, ANTHROPIC_BASE_URL, and key-in-URL formats — no plugins, no code changes.
🔁
Auto session tracking
No session header needed. Session IDs are derived from your API key and first message — same conversation always maps to the same session automatically.

One URL change. That's it.

No SDK changes, no code refactors. Point your client at TokenMinMax and every request is automatically compressed.

Anthropic — Before
# Your existing code
client = Anthropic()
Anthropic — After
client = Anthropic(
  base_url="https://www.tokenminmax.com/anthropic",
  default_headers={
    "x-tokenminmax-key": "tmm-yourkey"
  }
)
OpenAI / Cursor — Before
# Your existing code
client = OpenAI()
OpenAI / Cursor — After
client = OpenAI(
  base_url="https://www.tokenminmax.com/openai",
  default_headers={
    "x-tokenminmax-key": "tmm-yourkey"
  }
)

Simple, transparent pricing

Tier 1+2 free on all plans. Tier 3 included from Starter. Tier 4 session memory from Growth. Extra credits available on Pro.