I had a feeling my Claude Code usage had gotten weird.
Not “I asked too many questions” weird. More like: I had terminal sessions open all day, cron jobs running, a persistent remote-control service so I could poke Claude Code from my phone, and occasional orchestrator prompts that spun up agents, those agents spun up worktrees, and the worktrees sometimes ran their own PRD implementation loops.
That is not a chat history anymore. That is an operating system made of transcripts.
So I finally audited it.
0B
30-day tokens
$0
API list cost
0
× Max subscription
0
largest subagent fan-out
Thirty days. 3.28 billion tokens across Claude Code. ~$2,400 at Anthropic’s published API list pricing — about 12x the cost of a $200 Max subscription. The number to anchor on isn’t the cache breakdown — those are local-log estimates with caveats. The number to anchor on is the order of magnitude and the shape: where did all this go, and what would I change next month?
The headline surprised me: the extra tools, skills, and MCP definitions were not the main bill. They were visible. They were annoying. They were worth cleaning up. But they were not where the real burn came from.
The real burn came from a simple pattern:
I was not paying for one smart agent. I was paying for many agents to carry many large cached prefixes through many turns.
— Brett Ridenour
What I installed
I started with two open-source local tools:
ccusage, which reads local Claude Code usage from disk and reports daily, session, and billing-block-style usage.cc-lens, which gives a local browser dashboard over~/.claude.
For Opus 4.7, the public Claude API pricing page lists $15/M input and $75/M output, with cache pricing also published per tier. ccusage reads my local JSONL transcripts and produces session-level dollar estimates from them. Those estimates are useful for shape — which orchestrator was big, which workers ran long, which sessions dominated — but I’m not going to publish bill-grade caching-specific breakdowns from them. If you want exact numbers, your real Anthropic invoice is the source of truth. The local logs are how I find out which session to argue about.
Then I added my own tiny command:
claude-session-audit <session-id-or-path>
That command walks a single Claude Code session, includes its nested subagents/ folder, sums token usage directly from the JSONL, and writes a markdown report with:
- parent-vs-subagent usage
- cache-read and cache-create totals
- largest cache events
- repeated billing shapes
- tool-call counts
- initial prompt samples
The important thing is that it audits a workload, not just a day.
ccusage daily answers: “How much did I use on Tuesday?”
claude-session-audit answers: “What did that giant orchestrator actually do?”
Where Claude Code stores the evidence
Claude Code writes local transcripts under:
~/.claude/projects/
The folder names are path-slugged. My Freebo repo, for example, shows up as:
~/.claude/projects/-home-brettr-Documents-FreeboDecember/
Main sessions are JSONL files:
3f564fba-c351-43ad-b9bd-a4ca6af99b26.jsonl
Subagents usually live beside that file in a folder with the same session id:
3f564fba-c351-43ad-b9bd-a4ca6af99b26/subagents/agent-a545f899c456836fc.jsonl
In my current local tree, the main Claude Code profile had:
1,740JSONL files641subagent JSONLs- many project roots, including home, FreeboDecember, Brett Omarchy, Reelforge, Astro blog, and old worktree paths
That explained why aggregate reports were useful but not sufficient. A “session” can hide a whole agent network.
The FreeboDecember orchestrator
The first run I wanted to audit was a huge FreeboDecember PRD push.
This was the kind of prompt that makes sense at midnight and looks insane in the morning: a parent orchestrator dispatching many feature implementers, each in isolated git worktrees, each responsible for turning a PRD into a branch and PR.
The session id was:
3f564fba-c351-43ad-b9bd-a4ca6af99b26
The audit found 28 JSONL files: the parent transcript plus 27 subagent transcripts.
The parent was the biggest single line item.
| Unit | Total tokens | Cache read | Cache create | Output | Est. cost |
|---|---|---|---|---|---|
| Parent orchestrator | 115.9M | 114.2M | 1.4M | 265.7K | $73 |
| Top worker: PRD-J | 23.4M | 23.0M | 295.9K | 55.7K | $15 |
| Top worker: PRD-I | 15.8M | 15.5M | 298.0K | 33.5K | $10 |
| Top worker: PRD-M | 15.2M | 14.9M | 249.9K | 33.3K | $10 |
The cache-read column is the story. Almost all of the tokens were cached prefix reads.
That is cheaper than raw input. It is still usage.
The 100-subagent run
Then I found the run I had in the back of my mind: the one with about 100 subagents.
It was under:
~/.claude/projects/-home-brettr/433b6ea4-c48e-4f27-9c0b-20acfacf74cd/
That session had 101 JSONL files: the parent plus exactly 100 subagents.
This one had the opposite shape.
FreeboDecember was mostly cache reads. The 100-subagent run had a huge cache-creation bill:
38.8Mcache-create tokens47.2Mcache-read tokens1.87Moutput tokens- about
$313ccusage-style estimate
The parent alone accounted for about $263.
That is the moment the model of “number of subagents equals cost” gets too simple. The subagents matter, but the parent orchestration loop can dominate if it keeps creating and mutating a large prefix.
The thing I thought was expensive
I expected the bloated tool list to be the smoking gun.
Each FreeboDecember worker was starting with a huge capability surface:
- about
315tool names per subagent 144skills listed- browser tools
- Slack tools
- Notion tools
- Wix tools
- Vercel tools
- n8n tools
- Todoist tools
- NotebookLM tools
- Railway tools
- Supabase tools
- the actual tools needed for code
That looks bad because it is bad.
But the direct cost was smaller than I expected.
I measured the first assistant turn for each FreeboDecember subagent. That is the closest observable proxy for “booting the agent with tools, skills, MCPs, and initial context.”
Across all 27 subagents, first-turn startup was about:
711,618cache-create tokens157,824cache-read tokens- roughly
$4.50ccusage-style estimate
So I changed my mental model.
The bad part of a giant tool surface is not only “the prompt is bigger.” It is that it makes the agent’s world bigger.
A worker implementing a database migration does not need to know about Canva, Slack, Spotify, Wix, Todoist, Airbnb, Vercel toolbar threads, or browser screenshot controls. Even if those definitions cache well, they are still part of the attention surface. They increase the chance the agent explores, retries, or routes work through a tool that should not exist in that context.
The thing that actually got expensive
The FreeboDecember run was expensive because it combined four multipliers:
The bill was not one catastrophic prompt. It was multiplication.
The top worker alone had 193 assistant turns.
Another had 156.
Another had 148.
That is where “cached” stops feeling free. If a worker reads a 100K-token prefix 150 times, the fact that each read is cheap is not enough to make the run small.
Prompt caching made the run possible. It did not make the run disciplined.
— Brett Ridenour
The idle session question
I also wanted to understand what happens when I leave Claude Code open.
The short version:
An idle Claude Code session should not burn tokens by itself.
Tokens are consumed when something actually calls the model:
- user prompts
- agents
- cron jobs
- remote-control actions
- loops
- scheduled runs
- monitoring prompts
- tool-driven retries
Leaving the terminal open is not the same thing as running the model.
But cache is time-sensitive. If I leave a session idle for a day, the next message is probably not benefiting from the short-lived prompt cache. It may recreate a large prefix: tools, instructions, project context, prior transcript, loaded docs.
That is why a day-old session can feel “free” while sitting there and then expensive on the next real turn.
What I would change in the orchestrator
If I rewrote my feature-orchestrator skill based on this audit, I would not start by deleting tools. I would start by changing the shape of the work.
1. Planning becomes mandatory
Every PRD should go through a cheap planning gate before implementation.
The output should be a compact execution brief:
- files likely touched
- DB/API/UI scope
- tests required
- dependency notes
- conflict risks
- “do not read” areas
- what would make the worker stop
Only the brief goes to implementation workers. Not the whole sprint context.
2. Fan-out gets capped
No more giant wave unless the work is truly independent.
For serious code:
- 3 to 4 workers per wave
- each wave must finish or block
- consolidator reads PR summaries and diffs
- next wave launches only after contracts are stable
For content or research:
- large fan-out is fine only if each worker gets a tiny prompt
- parent should not keep regenerating a giant state object
- workers should return strict JSON or compact bullets
3. Workers get roles, not the universe
The worker prompt should say what kind of worker it is.
Each role gets only the tools and instructions it needs.
The DB worker does not need browser tools.
The docs worker does not need Supabase admin tools.
The API worker does not need Todoist.
The content worker does not need Railway.
4. Model routing becomes explicit
I had too much Opus doing mechanical work.
The split I want:
- Opus for orchestration, architectural review, final judgment
- Sonnet for normal implementation
- Haiku for summarization, docs, status compression, mechanical extraction
That alone would probably matter more than shaving a few thousand tokens off the startup context.
5. Stop conditions should detect repeated assumptions
The earlier failure mode in this sprint was not “an agent wrote bad code.” It was multiple PRDs making incompatible assumptions about a feature flag and a shared contract.
The orchestrator should stop if:
- two workers fail on the same upstream assumption
- two workers touch the same migration or enum
- a feature flag is required by downstream PRDs but not enabled
- a shared type changes after workers have already started
- CI failures repeat across a wave
That is not a technical failure. It is a planning failure. The orchestrator needs to recognize it.
6. Status output gets compressed
Long status reports are comforting and expensive.
Workers should return a compact schema:
{
"status": "done | blocked | failed",
"branch": "feat/example",
"pr": "url-or-null",
"files_changed": 12,
"tests": ["typecheck", "lint"],
"blockers": [],
"contracts_changed": ["locations.vertical"]
}
The parent can store that. It does not need prose unless something is actually blocked.
7. Usage telemetry becomes part of the loop
After each wave, the orchestrator should run a usage audit.
Not after the sprint. Not after the bill feels weird. After the wave.
The report should ask:
- Which agent spent the most?
- Was the parent more expensive than the workers?
- Did cache creation spike?
- Did repeated billing signatures show polling or retries?
- Did the tool list include irrelevant MCPs?
- Should the next wave continue?
That last question matters. Sometimes the correct next action is not “spawn more agents.” It is “summarize, compact, and restart from a narrower plan.”
The revised orchestrator contract
If I rewrote the skill tomorrow, the heart of it would look like this:
You are a wave orchestrator, not a giant autonomous engineer.
Phase 1: turn PRDs into compact execution briefs. No code.
Phase 2: select at most 4 independent briefs.
Phase 3: spawn narrow workers with role-specific tools.
Phase 4: require compact structured status.
Phase 5: consolidate PRs, contracts, and failures.
Phase 6: run token audit before the next wave.
Stop if repeated failures point to one shared assumption.
Stop if a shared contract changes mid-wave.
Stop if the parent session becomes the largest cost center.
Stop if the next wave would launch with unresolved schema or feature-flag state.
That is less glamorous than “run 27 engineers all night.”
It is also closer to how good engineering work actually scales.
What I learned
The important lesson is not “agents are expensive.”
The important lesson is that agent architecture has a cost model.
The startup context was visible and easy to blame. It was not the main cost.
The real problem was that I let an orchestrator create a lot of long-running workers, each with enough context and freedom to behave like a full engineer, and then I let them loop until the PRDs were either implemented, blocked, or exhausted.
That can be useful. Sometimes it is exactly what I want.
But if I am going to do it regularly, I need to treat it like infrastructure:
- measure every big run
- keep workers narrow
- cap fan-out
- route models by job type
- compress status
- stop on repeated assumptions
- audit after each wave
Claude Code did not become expensive because it was sitting open in a terminal.
It became expensive when I gave it a whole engineering organization and forgot to give that organization a budget.