Compression without degradation · Works with Anthropic, OpenAI & Cursor

Cut your LLM costs
by 40–70%.
Change one URL.

TokenMinMax compresses every LLM request automatically — 95% of context retained, zero quality loss. Short conversations save 40%. Long sessions save up to 70%. Works with Anthropic, OpenAI, Claude Code, and Cursor. No SDK changes required.

Start 7-day free trial → See pricing

How it works

Four-tier compression pipeline

Every request passes through a sequential pipeline before reaching your LLM provider. Always fails open — if any tier fails, the request proceeds unaffected.

Tier 1

Free

Rule-based cleanup

Strips filler phrases, repeated acknowledgments, and redundant whitespace. Runs synchronously on every request — no external calls, no added latency.

~5%

~5ms · free

Tier 2

Free

Entropy scoring

Scores each message by information density. Low-entropy messages (greetings, one-word acks) are dropped or shortened. Code blocks, numbers, and decisions are always protected.

~5%

~5ms · free

Tier 3

included

ML token pruning

ML token pruning on contexts above 4,000 tokens. Code-heavy messages are automatically detected and skipped — code is never sent to the pruning model.

~10%

~100ms

Tier 4

included

Session summarisation

Runs async after the response — zero latency impact. Builds a rolling summary of older conversation turns and injects it as a read-only prefix. Tier 1/Tier 2/Tier 3 only ever see new messages, never the summary. Best for long multi-turn chats, agentic workflows, Claude Code and Cursor sessions where history compounds with every request.

30–50%

+0ms · async after response

Features

Everything you need, nothing you don't

⚡

Full streaming support

Compression happens before the request. Streaming passthrough is unmodified — no buffering, no UX impact. Accurate token stats appended as a trailing SSE event.

🛡

Content protection

Code, numbers, legal text, and other high-value content are detected and skipped by Tier 3 — never sent to the pruning model. A safety mechanism ensures protected content is always preserved in the Tier 4 session summary even if it ages out of the active context.

📊

Savings dashboard

Real-time per-request stats. Input token savings, dollar savings by model, and per-tier breakdown — Tier 1/Tier 2, Tier 3, and Tier 4 tracked separately.

🔒

Privacy Mode

One header disables all external compression APIs. Data flows only through the proxy to your LLM provider — no third-party processors.

🖱

Cursor + Claude Code

Set your base URL and every request is automatically compressed. Works with Cursor settings, ANTHROPIC_BASE_URL, and key-in-URL formats — no plugins, no code changes.

🔁

Auto session tracking

No session header needed. Session IDs are derived from your API key and first message — same conversation always maps to the same session automatically.

Integration

One URL change. That's it.

No SDK changes, no code refactors. Point your client at TokenMinMax and every request is automatically compressed.

Anthropic — Before

# Your existing code
client = Anthropic()

Anthropic — After

client = Anthropic(
  base_url="https://www.tokenminmax.com/anthropic",
  default_headers={
    "x-tokenminmax-key": "tmm-yourkey"
  }
)

OpenAI / Cursor — Before

# Your existing code
client = OpenAI()

OpenAI / Cursor — After

client = OpenAI(
  base_url="https://www.tokenminmax.com/openai",
  default_headers={
    "x-tokenminmax-key": "tmm-yourkey"
  }
)

Cut your LLM costsby 40–70%.Change one URL.

Four-tier compression pipeline

Everything you need, nothing you don't

One URL change. That's it.

Simple, transparent pricing

Cut your LLM costs
by 40–70%.
Change one URL.