"Hit my Claude limit" is the new flex
...and what I do now to keep my AI working. Everything you need to know and a framework.
AI tokens got roughly 1,000x cheaper in three years. My AI bill went from $0 to $200 — and $200 wasn’t enough.
Last week I was upgrading our startup’s UI while one agent scanned industry trends in the background and Claude chewed through transcripts from a user interview and an adviser meeting we’d just had. Normal Tuesday.
Then the screen told me I’d hit my limit on max Claude’s Max plan.
So… I read Dostoevsky. Spent additional time with my 1-year-old. Baked a quiche. It turned out a very good break, but…
I thought I was doing something wrong. Then my friends started showing off. “Hit my limit by 2pm today.” It’s a flex now.
Uber burned through its annual AI budget a few months into 2026. Engineering max plans run $100-200 per person — one CTO ran $600 bills. My number wasn’t high. It was the floor.
You don’t plan to overspend on AI. Or on Sicily vacation. The bill comes anyway.
Where 99.93% of your tokens go
So, my first instinct was to blame myself — maybe my mempalace plugin config was bloated, maybe my system instructions were too long. I trimmed both. Still hit the limit.
I asked Claude to audit its own token usage:
It re-read and re-wrote again and again.
A developer who audited 30 days of Claude Code put a number on it: 1,310 re-reads per 1 productive token. 99.93% of what I was paying for was the machine reading what it already knew.
I kept pulling the thread. One wall became a stack.
The routing tilts expensive: Claude Code sends 93.8% of tokens through Opus, the priciest model in the lineup.
A friend sent me a Hacker News screenshot. Boris Cherny, Claude Code creator, was explaining why bills spike: million-token contexts get expensive fast, and background agents trigger surprise charges for “a surprisingly large number of users.” The thread had 750+ points and 650+ comments.
Agents are voracious — agentic workflows use 5-30x more tokens than a chatbot, multi-agent setups about 15x more than a single chat. They don’t know which files they need, so they read everything — 60-80% of tokens wasted just locating code. Every MCP plugin ships its tool schemas with every call; one team found 72% of their context window eaten by tool definitions before the agent saw a single message. I was running 30 agents with mempalace, GitHub, and Slack plugins. I ran around 30 agents daily. The math was not in my favor.
Thinking tokens bill at output rates, 5x input. Self-attention scales quadratically: double the context, quadruple the compute. Each layer multiplies through the others faster than I could trim.
Economists call this Jevons Paradox. a16z calls it “LLMflation”. Cheaper units don’t reduce consumption. They unlock it.
At the same time Anthropic pays even more.
One user calculated their virtual cost at $1,274 a month — on a $200 plan. Anthropic eats the difference today; that subsidy won’t last.
DeSight Studio’s analysis puts the platform-side number at roughly $5,000 of inference compute burned per $200 Max subscriber. That’s not a small gap a price hike closes; it’s 25x. Every heavy user is being underwritten by venture capital and the implicit promise that someone, eventually, pays full price.
Designing around the bill
When my token bill kept climbing, I stopped trying to push it down and started designing around it.
So I tried the obvious fixes first.
Killed half my skills and plugins — the ones I’d installed once and forgotten. Cleaner config, it helped a bit. The overhead wasn’t in the tools I’d stopped using; it was in the ones I used every day.
Compacted earlier, cleared more often. Middle of a pitch deck final, I ran /compact too soon — the thread I’d been chasing got summarized into a sentence, and the investigation was gone. I spent the next hour rebuilding what I’d just discarded.
Routed the cheap work to Haiku. Extracting action items from a user research transcript seemed like pattern-matching. It wasn’t. Haiku returned generic bullets that missed the three decisions the customer had actually announced. I reran on Opus, ate two passes instead of one, plus the time I’d lost trusting the first output.
Switched to Cursor one day when Claude locked me out again — same folders, same files, two hours. It was fun, I realized that I’m truly model agnostic as well as my work, but I still prefer Claude.
The token-by-token war was unwinnable. Every fix bought back a few percent that the next agent run spent twice over. I needed a discipline.
Every cheap technology creates an expensive discipline
Cloud computing got FinOps. AI hasn’t named its version yet.
A year ago, “AI cost management” wasn’t a PM skill. Axial Search analyzed 592 AI PM job postings and flagged “inference cost economics” — what does it cost to serve this feature at scale? — as a skill companies need but rarely require. Jensen Huang says a $500K engineer should spend $250K on tokens.
Four things I do now.
1. Outcomes, not tokens.
Tokens went up last year. Value didn’t. The honest answer for “what does this pipeline cost per run?” is — I don’t know yet. The instruments shipped weeks ago. I just set up my custom /statusline to show model name and context percentage with a progress bar (or whatever you want in your own words) and Claude sets it up for you. If you pay for each token you can always type /cost.
My status line:
I check status line on my daily checkin first. 10K tokens per session — and most of it was the model guessing at my mempalace table schemas before every query. That’s when I locked the schemas. Halved the routine, got back 5K tokens per run. Small number, but I run it every day. Small number times every day is free tokens.
You stop asking what’s eating your context. You look down. And once you can see what a task will cost, you start making decisions you couldn’t before — heavy runs go to nighttime, batches replace one-offs, loops handle the rest. Some pipelines I trim. Some I kill.
2. Route by capability.
You’ll find most of your tasks could be done with cheaper models (but you have to test, unless you can even spend more). 5x between models on input. 5x again on output. Advisor strategy: expensive model plans, cheap model executes. Thinking tokens bill as output, so /effort low strips extended thinking on routine work.
3. Three knobs that move the bill.
I was running mempalace, GitHub, and Slack plugins in parallel. Each shipped its full tool schema with every message I sent. 72% of my context was spent before Claude read the first character of my prompt.
The 8-multiplier list looks intimidating. In practice, three knobs do most of the work:
You’ll hear engineers cite a “10% context” rule of thumb for when Claude starts degrading. For Opus 4.6 the real threshold is closer to 25-30% of its 1M window. Chroma tested 18 frontier models. Every one degrades with input length.
4. Stay portable.
I used to call portability a cost hedge. April 2026 rewired that. Nine incident days in sixteen: April 3, 6, 7, 8, 10, 13, 14, 15, 16. Anthropic’s status page sits at 98.79% over the trailing 90 days — “one nine,” as a Hacker News commenter put it under the April 13 outage thread. Fortune ran a named backlash piece on April 14 with AMD’s senior director of AI on record: “Claude has regressed to the point that it cannot be trusted to perform complex engineering.” A GitHub issue documents a month-long usage-drain bug — some Max 20× sessions depleted in 19 minutes. No RCA. No credits.
(at least now we know it was all for Opus 4.7’s sake.)
While I was researching for this article, I dispatched an Opus 4.6 subagent to research those outages. After ten minutes and thirty-four seconds, it died: API Error: 529 — Overloaded. Status page at that minute: “Opus 4.6 elevated rate of errors — identified, fix in progress.”
What I do each session:
Pick the model for the task.
Activate /caveman — strips filler from Claude’s output, ~65% fewer output tokens.
Glance at the statusline.
When context passes 35%, ask Claude for a handover note.
Then /compact or /clear and start fresh.
This morning: opening Claude and running /caveman took 3% of my context. Adding my daily personal assistant checkin — another 2%. Five percent, 5K tokens, before real work began.
I wrote “It Was Never About the Model” a week ago. Turns out that applies to how you pay for it too.
I don’t mind getting kicked out out from time to time
Now when Claude locks me out, I close the laptop.
Either I meant to spend it or I switch to Cursor. Both are fine.
The rare unproductive hour is the part I almost missed. Read a book. Sit with my kid. Whatever — at this pace, those hours are precious. Grab yours when they land.
Thank you to my founding members — Kostas Nasis, Artem Krivonos, and Kristina Hananeina. Your early support made this newsletter real.









I will apply this tomorrow!