The punchline
Your session cost compounds exponentially, not linearly. In community analysis of a 100+ message chat, 98.5% of all tokens were spent re-reading old conversation, not generating new work. The fix isn't bigger models — it's knowing which of the five things to do after every Claude response, and doing it on time.
1. What is a context window?
A context window is everything Claude can see at one moment — Claude's current working memory. It includes:
- · The system prompt
- · Your full conversation so far (every turn, both sides)
- · Every tool call Claude has issued
- · Every tool output that came back
- · Every file Claude has read
- · Every skill, MCP server, or agent loaded into the project
- · CLAUDE.md and any project-level context files
Context window sizes differ by surface. As of April 2026:
| Surface | Window | Notes |
|---|---|---|
| Claude.ai (Pro / Max) | 200K | Per conversation |
| Claude Code | 1M (Opus) | Per session |
| API — Sonnet | 200K | Per request |
| API — Opus | 1M | Per request |
The baseline overhead nobody talks about
A fresh Claude Code session isn't at zero tokens. Before you type anything, startup overhead commonly burns 8,000+ tokens from system prompts, CLAUDE.md, context files, MCP tool schemas, and loaded skills.
In heavier setups that baseline is much larger — one fresh session was observed at 62,000 tokensbefore the first user prompt. That's 6.2% of a 1M window spent before the work begins.
Do this now: open a fresh Claude Code session and run /context. If you're already above 10,000 tokens on a cold start, audit what's in your CLAUDE.md, skills, and MCP loadout.
2. How tokens actually work
A token is the smallest unit of text Claude reads and bills for. Roughly one token ≈ one word, but not exactly — punctuation, spaces, and word pieces count separately. “Unhappiness” is usually 3 tokens; “the cat sat” is usually 4.
The compounding cost most people miss
Every time you send a message, Claude rereads the entire conversation from the beginning plus the new input, then generates a response. Those re-read tokens are billable input tokens.
This means your session cost doesn't add — it compounds. Rough numbers from a typical coding chat:
| Message | Input tokens | Cumulative |
|---|---|---|
| #1 | 500 | 500 |
| #5 | 2,500 | ~8K |
| #15 | 8,000 | ~70K |
| #30 | 15,500 | ~240K |
Message 30 costs 31× more per turn than message 1. A community analysis of a 100+ message session found 98.5% of total token spend was re-reading the prior conversation — not new input, not output, just rehashing what had already been said.
The implication is unintuitive: the cheapest thing you can do is start a new session sooner, not try to be terser in the existing one.
3. Input vs output tokens
Output tokens cost more per token than input tokens — across every current Claude model, output is priced 4-5× higher.
Instinct says: “tell Claude to be concise, cut output, save money.” In practice this doesn't move the needle. Output volume in a typical coding session is dwarfed by tool outputs — file reads, grep results, test runs, build logs — which all end up on the input side on the next turn.
“Be concise” plugins and prompts have been measured in community A/B tests — real savings are usually in the 5-10% range, not the 50-70% people expect. The thing that actually moves token spend is managing what enters your context in the first place.
4. Context rot (AI dementia)
Context rot is what happens when a session runs long: model performance measurably degrades as Claude's attention spreads across every token in the window. You've seen the symptoms:
- · Contradicting itself across turns
- · Editing files it didn't bother to re-read
- · Forgetting decisions from earlier in the session
- · Getting vague and noticeably sloppier
The numbers from Anthropic's own retrieval benchmarks are stark:
| Context fill | Retrieval accuracy |
|---|---|
| 256K tokens | 92% |
| 1M tokens | 78% |
A 14-point accuracy drop means Claude is measurably worseat finding what it needs in a nearly-full window. The session can “technically” hold a million tokens and still be the wrong tool for the job at that size.
Independent community analysis of 18,000 thinking blocks across 7,000 Claude Code sessions found that thinking depth drops 67% as sessions get longer, and edit-without-reading rates rose from 6% in short sessions to 34% in long ones. Longer sessions produce sloppier work — your token efficiency goes down because you need more turns to hit the same output quality you would have gotten at minute five.
Practical rule:if Claude gets something wrong twice in a row, don't push through — the session is already degraded. Rewind, or restart with a fresh context. You'll burn fewer tokens across the total arc.
5. Usage limits across surfaces
Usage limits work differently on every Claude surface. Here's the rough shape of each:
Claude.ai (Pro, Max, Team, Enterprise)
Usage is measured as a rolling 5-hour allowance of messages, with plan tier determining how many messages that allowance contains. Long conversations count more heavily because each message re-reads the entire thread — the same 10-message chat consumes more quota than 10 fresh 1-message conversations.
Claude Code
Claude Code meters on both a 5-hour rolling window and a weekly limit, with the exact allowance depending on your plan. The new desktop app surfaces your remaining session-limit bar live — watching that bar is the single cheapest habit to build. One monitor for work, one glance at the limit.
API (Anthropic platform)
The API uses token-per-minute and request-per-minute rate limits that scale with your usage tier. Tiers are unlocked automatically as your spend and history grow — see the Anthropic console for your current tier and ceilings.
What burns limits fastest
- Long conversations. The compounding re-read is the biggest single line item.
- Large file reads dumped into the session. One multi-megabyte log pasted in stays resident for the rest of the session.
- Auto-compaction firing late.When compaction triggers at 95%, you've already paid to re-read a near-full window many times.
- Opus when Sonnet would do. Opus is roughly 5× the per-token cost of Sonnet and 25× the cost of Haiku; choose the model to the task, not the other way around.
6. The five choices after every Claude response
Anthropic frames session management around a simple insight: after each Claude response, you have exactly five choices. Knowing which to pick is the single biggest lever on token spend.
Just replySend another message. Natural, easy, and expensive — every turn compounds. Use it when you're close to the end of a focused task.
Double-Esc · /rewindJump back to any earlier message and drop everything after it. Anthropic's #1 recommendation. Failed attempts that stay in context keep polluting future responses.
/clearStart a completely fresh session. Use when you're switching tasks entirely, or when the current session has visibly rotted.
/compactSummarize the session and replace the full history with the summary. Useful mid-task — but do it manually around 50-60% fill, not when auto fires.
Task toolDelegate a bounded task to a fresh context window. Sub-agent returns only the result, keeping your main session clean. Often the cheapest move.
Why rewind beats “try again”
The most common way people waste tokens: Claude produces broken code, you say “that didn't work, try this,” and Claude tries something else. It often works — but the failed attempt is still in context and gets re-read on every subsequent turn. /rewind lets you jump back and retry with a clean slate.
When you use the Rewind menu in Claude Code, there's also a “Summarize from here”option that creates a handoff message — a note from Claude's future-self to its past-self saying “here's what we figured out, skip the dead ends.” Use it when you want the learning but not the baggage.
7. Manual vs auto compaction
Auto-compaction kicks in at roughly 95% of your context window. The community consensus, which aligns with our experience, is that 95% is way too late. Two problems:
- · Auto-compact only retains 20-30% of the original detail. You lose decisions, file paths, and nuance that the next phase needs.
- · The compaction itself runs at the model's least intelligent point — peak context rot. The summary quality is at its worst.
Analogy: packing for a trip. If you pack the night before, you make a list, check it, and remember everything. If you throw clothes in the bag five minutes before leaving, you forget your charger, your toothbrush, the adapter. Auto-compact at 95% is the five-minutes-before pack.
The manual-compact pattern that works
Instead of waiting for auto, use this three-step pattern at around 50-60% of your window:
- 1Ask Claude: “Give me a full summary of what we've decided, what we've shipped, the files that matter, and exactly where we are on the current task.”
- 2Copy the summary. Run
/clear. - 3Paste the summary as your first message in the fresh session. Keep going.
Why this beats /compact: you control the summary quality because Claude is still sharp at 60% fill. You also get a chance to review the summary before committing, strip anything irrelevant, and add tracking files (decision log, task list, plan doc) so nothing important lives only in chat history.
Heuristic:if you're on 1M-Opus, move to a fresh session at ~120K tokens (12%). Past that point context rot starts paying dividends. You can always chain a follow-up session with the summary.
8. Sub-agents done right
A sub-agent is a Claude task delegated to its own fresh context window. It does its work, synthesizes results, and returns a bounded output to your main session. Your main session never sees the intermediate steps.
Think of it as briefing a research intern. If you wanted an intern to synthesize 50 articles, you wouldn't sit behind them reading each article alongside. You'd say “find the three most important findings and hand me the summary.” You spend one sentence of headspace; they do the grind.
When to delegate
- · Codebase exploration— “map how auth works, summarize the flow, list the three files I need to read”
- · Documentation lookup— “find the correct API signature in node_modules, return just the signature”
- · Verification— “run the test suite and report only the first failing test”
- · Independent review— “check this migration for concurrency issues, return a yes/no + one-paragraph reasoning”
Match the model to the task
Sub-agents don't have to run on your main session's model. A “summarize these 15 files” agent on Haiku costs ~4% of what it costs on Opus and delivers the same quality. Be explicit in the delegation prompt: “Use Haiku. Return a bulleted summary under 200 words.”
9. Practical token-saving techniques
Watch your session meter constantly
If you're on the Claude desktop app, keep the session- limit bar visible on a second monitor. Nothing changes habits faster than seeing the number tick down in real time. If you're at 60% and the reset is an hour away, abuse it — spin up agents, run heavy tasks. If you're at 20% with four hours left, be strategic.
Convert everything to markdown before Claude reads it
PDF, DOCX, and HTML carry layout, metadata, and styling noise that the model doesn't need. Stripping to markdown gives you roughly the same content in a fraction of the tokens:
| From | To | Token reduction |
|---|---|---|
| HTML | Markdown | ~90% |
| PDF (text) | Markdown | 65-70% |
| DOCX | Markdown | ~33% |
A 40-page PDF and a 130-page markdown file fit in roughly the same amount of context. Tools like Docling, Marker, or a quick pandoc conversion handle this in seconds. PDF-with-OCR for scanned documents is a separate conversation — there you need vision input and the savings don't apply.
Use /btw for side questions
Claude Code's /btw opens a quick overlay for side questions that don't enter your conversation history. Use it when you're deep in a task and need a quick reference — what's the flag for X, which file is config in — without polluting the main context.
Start every session in plan mode
Boris Chernyi, creator of Claude Code, has said publicly that he starts every session in plan mode. Getting agreement on the plan before writing a line of code consistently costs fewer total tokens than jumping straight to implementation. Bad plans cause rework. Rework compounds. One good 10-minute plan saves an hour of mid-task pivots.
CLAUDE.md discipline — under 200 lines
CLAUDE.md loads on every session. If it's 600 lines of meandering conventions, you pay for those 600 lines every time. The rule of thumb:
- · CLAUDE.md: under 200 lines / ~2K tokens
- · Specialized instructions → move to skills or context files loaded on demand
- · Large repos → add a
.claudeignoreso Claude doesn't pull in vendor directories, build artifacts, or lockfiles
Session chaining for big projects
If you're running a large project across multiple days, don't try to do it all in one session. Split into an assembly line where each session has a specialized job:
- DiscoveryClaude reads the PDFs, maps the codebase, produces a summary doc. Clear.
- PlanningFresh session reads the summary doc, produces an implementation plan with milestones. Clear.
- ExecutionFresh session reads the plan, does the implementation milestone-by-milestone. Compact between milestones.
10. Knowing where your tokens go
You can't optimize what you can't see. If your only signal is “session limit hit again,” you have no basis for changing habits. A useful token-observability setup tracks at minimum:
- · Tokens per project — which repos burn the most
- · Tokens per prompt — which specific requests were the most expensive, so you can reverse-engineer why
- · Input vs output split— a project that's 80% input is usually a read-heavy setup that would benefit from more aggressive file filtering
- · Cache hit rate— Anthropic's prompt cache gives you a ~90% discount on cached input tokens. Low cache hit rate is a sign your prompts are changing too much session-to-session
- · Skill + tool invocation frequency— a skill running 181 times a week is either paying for itself or you're accidentally triggering it
A simple approach: log each message's input / output / cache-read / cache-create token counts into a spreadsheet or a small SQLite database, group by project and model, and review weekly. You don't need a fancy dashboard — a Monday-morning 5-minute review catches the runaway sessions before they become monthly-bill surprises.
11. Why 1M doesn't mean “use 1M”
When people hear “1 million token context window,” they treat it like a budget to spend. Three failure modes follow:
- · Stop using sub-agentsbecause there's “room” in the main session
- · Stop compacting because the progress bar still looks mostly empty
- · Offload everything into one giant session instead of chaining
The rules of how transformer models work didn't change when the window got bigger. A bigger window means more room for context rot, not a better output ceiling. The first 20%of a session is prime time — CLAUDE.md is fresh, the model's attention is focused, and retrieval accuracy is near its ceiling.
The 120K rule
On Opus with a 1M window, cap yourself at ~120K tokens (12%). That's the same 60% mark people successfully used back when 200K was the ceiling — a decent guardrail for staying in the model's sharp range. You can always chain a follow-up session.
The number isn't sacred. But having a concrete number you check against is worth more than “I'll compact when it feels right.”
If you're new to Claude Code, stay on the 200K context models for a few weeks before moving to Opus 1M. Learn the discipline — compaction habits, session chaining, CLAUDE.md hygiene — on the smaller window first. When you graduate, the larger window is insurance, not a target.
12. Tool ecosystem
A handful of open-source tools tackle token reduction from different angles. Don't install all of them — each has overhead and they don't all play well together. Pick one or two that match your actual workload.
Token-killer
CLI proxy that filters terminal output before it hits your context. Useful when tool output is your biggest token sink.
Context-mode
Sandboxes raw tool output into SQLite instead of dumping it into the session. Claude queries the SQLite when it needs specifics.
Claude-context
Plans the next N turns and pre-compacts opportunistically. Best for long coding sessions where you already know the task shape.
Token-optimizer MCP
Exposes an MCP tool Claude can call to get token counts + suggestions for what to trim. Good for visibility, less for automatic savings.
Claude-token-efficient
A one-file CLAUDE.md template that keeps responses terse. Real savings are modest (per section 3) but it pairs well with the other tools.
How to pick: the cheapest approach is to point Claude itself at the list, describe your workflow in 3-4 sentences, and ask for the best match given your actual pain points. Any one tool is a small win. Good habits across all 13 sections of this guide are a 50-70% win.
13. When to just start over
Sometimes a session is just off. The model is repeating itself, it's missing obvious things, tool calls are failing for no clear reason. You're below your usual threshold for compaction. Should you soldier through?
No. Clear it. Start fresh. The cost of “wasting” a half-full session is almost always lower than the cost of pushing through context rot. Your best guard against burning tokens on a bad session is the discipline to walk away from it.
This is the habit people underrate the most. Having a tracking file or a handoff summary saved somewhere means that “starting fresh” costs you two minutes, not thirty. Build the safety net and using it becomes reflex.
FAQ
Is /compact or /clear better?
Depends on the next task. /compact keeps a summary of the current work — use it when you're continuing the same task but want a leaner context. /clear drops everything — use it when you're switching tasks entirely. In practice many people skip /compact and use the manual-summary-into-clear pattern from section 7, which gives more control over summary quality.
Why does my session get slower the longer I use it?
Two reasons. First, every new message re-reads the entire conversation so input tokens per turn keep growing. Second, context rot — the model itself gets measurably worse at retrieval and reasoning as the window fills (92% accuracy at 256K drops to 78% at 1M). Longer sessions are literally slower and literally dumber.
When should I use a sub-agent vs. just doing it in my main session?
Delegate when the task is bounded and the intermediate steps don't need to stay in your main context. Good candidates: codebase exploration, documentation lookup, running tests and reporting only the failures, independent code review. Don't delegate when the output needs to blend with your ongoing work or when you need to steer the model mid-task.
Do output tokens really cost more?
Yes — roughly 4-5× per token across current Claude models. But output volume is usually a small fraction of total spend in a typical session (tool outputs are counted on the input side on the next turn). Telling Claude to be concise saves maybe 5-10% on real workloads. The big savings are in managing what enters context in the first place.
What's Anthropic's prompt cache and should I use it?
The prompt cache gives you a ~90% discount on cached input tokens when the same prefix appears on multiple requests. It's automatic for repeated message prefixes in Claude Code and opt-in on the API. If your CLAUDE.md and system prompts are stable across a session, you're already benefiting. If your cache hit rate is low, you're probably churning the early parts of your context unnecessarily.
How do I tell when I'm getting close to a usage limit?
The Claude desktop app surfaces the remaining session-limit bar live. In Claude Code, /context shows you token usage against the window. The 5-hour rolling window isn't always visible as a specific number — if in doubt, check the Claude.ai settings page or your Anthropic console usage tab.
Is 1M context worth the extra cost over 200K?
Only if you genuinely need to load that much context at once. For most coding work, a 200K window plus good session chaining beats a 1M window plus sloppy habits. Retrieval accuracy drops 14 points between 256K and 1M — you can hold more, but Claude finds the right thing less reliably.
Next
Put this to work.
The fastest way to see where your tokens are actually going is to grade the prompts and skills you're already using. CHANN3L's Eval scores any skill, prompt, or agent definition on 7 dimensions in 3 minutes — and flags the specific parts that will bloat context fastest.
Want the deeper Anthropic docs? Start with prompt caching, context windows, and the Claude Code reference.