I lost an afternoon to Opus 4.7 last week
Claude Opus 4.7 feels worse because it stopped covering for your implicit prompts. 3 practices to fix it. And why AGI is not coming anytime soon.
I lost an afternoon to Opus 4.7 last week — installed it mid-project, watched it take longer than 4.6 and then fail. Half my feed was already complaining.
Burned my usage limit faster than expected — 4.7 costs more per token than 4.6. Switched to Gemini to get things done.
Hey! I’m Alena — ex-AI-startup CEO ($2M raised), ex-PM at Yandex and Acronis. 11k LinkedIn followers, 57% of them senior leaders at big tech. You’d be in good company.
Turned out 4.7 reads prompts differently. Anthropic’s own migration guide says 4.7 “will not infer requests you didn’t make.” It’s more literal. Old prompts that leaned on inferred intent now miss.
If you run Claude on autopilot — one prompt kicks off 20 steps with no human checking each one — you need it to be predictable, not brilliant. One misread on step 3 ruins steps 4 through 50. That’s what 4.7 delivers — literal, no guessing.
The model didn’t get worse, it stopped covering for the parts of my prompts I’d left implicit.
Anthropic even provided a migration guide. Nobody read it. Neither did I.
How loud it got
Business Insider picked it up within 24 hours:
Reddit “serious regression” thread: ~2,300 upvotes.
One X post “4.7 is worse than 4.6”: ~14,000 likes.
Gergely Orosz found 4.7 “surprisingly combative” and gave up, going back to 4.6.
Three complaint patterns: sticky corrections, doubled token bills, duller non-code prose
1) Corrections stopped sticking. Teams running agent loops reported corrections stuck less reliably than on 4.6. One documented case logged 14 interventions in ~12 hours before rollback to 4.6.
2) Token bills ran higher on the same work. The new tokenizer is part of why — it can map identical input to 1.0–1.35x tokens depending on content.
3) The model felt dumber on non-code tasks. Ethan Mollick flagged 4.7 doing less work on non-code/non-math tasks— skipping tool calls 4.6 would have made. HN writing threads said it less politely: verbose, weaker prose.
Counter-signal
Artificial Analysis: #1 GDPval-AA ranking for 4.7.
Cursor and GitHub shipped it in production and reported cleaner multi-step execution.
Anthropic reported meaningful gains on knowledge-work, memory, and vision tasks.
Both signals are real. The regressions happened. But the benchmark wins also happened.
You’re in the loud group if you run agent loops, have 4.6 prompts you haven’t touched in months, or your token bills spiked 20%+ on the same work. That’s who this piece is for.
What actually changed: 4.7 reads literally and costs more in tokens
Opus 4.7’s headline behavior change isn’t subtle. Anthropic’s own migration guide: “will not silently generalize an instruction from one item to another, and will not infer requests you didn’t make.”
4.6 guessed at intent. 4.7 executes words so vague prompts don’t fail silently anymore — they fail in production.
This shows up in two ways. One: 4.7 won’t do work you didn’t ask for. Two: when the ask itself is ambiguous, 4.7 won’t reach for context to disambiguate — it picks whichever reading lands first and runs.
Two things follow from the first:
Effort defaults got stingier. 4.7 respects effort levels (low / medium / high / xhigh) strictly. At low and medium, the model scopes work to what you explicitly asked rather than going above and beyond. Anthropic’s own guidance: use a minimum of high for anything that needs judgment, xhigh for coding and agents. Most “quality dropped” posts I’ve read are people running low or medium on work that needs high.
Thinking is off by default. Adaptive thinking is now opt-in. If you upgrade without explicitly turning it on, requests run with zero reasoning — a silent regression if your workflow relied on it.
Both are the same story: 4.7 does exactly what you told it, nothing more. If you didn’t tell it to think harder, it won’t. If you didn’t tell it to use the tool, it won’t reach for it.
The second face — picking the wrong reading of an ambiguous ask — is where it bites in unexpected ways.
4.7’s tokenizer also produces more tokens for the same text, so identical workloads cost more — and long sessions hit usage limits and context faster as a result.
Three practices for prompts that don’t rely on the model covering for you
1. Be precise
The cheapest fix and the one with the most leverage. Most of what feels like “4.7 got worse” is a prompt that had two valid readings, and 4.7 picked the one you didn’t want.
Example: I send Claude a draft and ask for a copy check. Half the time it starts doing what the draft is about — researching the topic, suggesting structure, executing the ideas in the text — and never looks at the grammar. The literal ask (”look at this”) had two valid readings. 4.6 would have used context to pick the right one. 4.7 didn’t because it stopped picking at all — it took whichever reading landed first.
Treat the prompt like a spec, not a wish.
(yep, I’m experimenting with Claude Design)
A contract has three parts:
Scope. What exactly to work on. Wish: “Look at my files.” Contract: “Review these three PDFs.”
Acceptance criteria. What “done” means. Wish: “Summarize the risks.” Contract: “Output a table with columns Risk / Severity / Source. Every row cites a verbatim quote.”
Forbidden actions. What must not happen. Wish: (left implicit, and so ignored). Contract: “Do not invent companies not in the documents. Do not infer numbers not stated.”
For analysis tasks, add an uncertainty path: “If the context doesn’t support a conclusion, output ‘INSUFFICIENT EVIDENCE’ and stop.” Without it, the model manufactures an answer rather than admit the gap.
If your prompt is long — a big draft, a packed brief, a thread that’s been running — restate the ask at the end. Something as small as “Reminder: only check grammar and flow. Don’t engage with the content.” The literal reading of “fix this draft” can drift across a long context. A one-line repeat at the bottom holds it.
2. Clear context
A literal-reading model amplifies context rot. 4.6 could patch over gaps by guessing what you probably meant from earlier turns. 4.7 won’t — so junk in the context window has more leverage on the output than it used to.
The discipline is the same one I covered in the limits piece: one task per thread, /clear at task boundaries, drop a handover doc and start fresh when the window gets heavy. With 4.7 the cost of skipping it is bigger than before.
3. Add a system prompt
If you find yourself retyping the same instructions across threads (”always quote the source,” “never invent competitor names,” “ask before irreversible changes”) — promote them. They belong somewhere the model reads on every turn, not buried in the 14th message of every chat.
If you work in chat (Claude.ai or Cowork):
On Claude.ai, open Settings → Personal Preferences. That’s your global system prompt — it attaches to every new chat.
For task-specific rules — output format, forbidden actions, tool-use policy — use Project instructions. Every new chat in that Project inherits them.
On Cowork, upload a persistent instructions doc. Claude reads it as system context every session.
If you work in Claude Code: CLAUDE.md is read as system context on every run. Same role as chat preferences.
The prompt for both setting is the same:
<non_negotiables>
- If evidence is missing, output "INSUFFICIENT EVIDENCE".
- Ask before making irreversible changes.
- When ambiguous, prefer the narrower reading and ask.
- Log every correction to mistake_log.md
</non_negotiables>I periodically ask Claude to review mistake log and adjust CLAUDE.md based on it.
If you can do just one thing: be precise.
Scope, acceptance criteria, forbidden actions on the next prompt you write. The single highest-leverage fix for a model that takes you at your word.
The rule I added
Treat every LLM upgrade like a migration, not a simple app new version.
Fifteen minutes is the migration guide and the what’s-new page. That’s where the stricter effort defaults and the “will not infer requests you didn’t make” line both live — the changes that broke my workflows. It’s now on my release-day checklist before I touch the next job.
When Capybara (aka Mythos) lands, I’ll start by learning how the new system works — not by switching all my work over the same afternoon.
What prompting cannot fix
Most perceived degradation is prompt and context debt. Some failures sit outside that. I stop rewriting prompts when I hit them.
Some tasks exceed the model. Long autonomous agent chains running without a human in the loop, or tasks that demand perfect accuracy on a domain the model has never seen — those sit past what any prompt unlocks. When you’re there, narrow the task or hand a step to a human — don’t iterate on the prompt.
When the same failure keeps showing up, it’s not a prompt problem anymore — it’s system design debt. Same shape, run after run: a classifier missing the same edge case, an agent citing products it hallucinated. What fixes it is a tool or a check outside the prompt — retrieval that returns the source row, a validator that rejects output when a required field is empty, a human reviewer for anything customer-visible or irreversible. The model saying “this is correct” is not verification. If prompt rewrite N+1 hasn’t worked, rewrite N+2 won’t either.
Hallucinations shrink with better formats. They don’t go to zero. Quote-grounding — requiring the model to cite the exact passage it’s relying on before concluding — is in Anthropic’s long-context playbook. It was part of what fixed the scrape; across my other runs it lowers hallucination rate, never to zero. Quote-grounding is a reduction, not a guarantee. Treat it as such.
Connecting the dots: AGI is not coming
Step back from the migration guide and the same shift shows up everywhere.
The industry stopped chasing AGI and started chasing reliability. TechCrunch’s 2026 outlook: “the focus is already shifting away from building ever-larger language models and toward the harder work of making AI usable.” IBM’s Peter Staar in the same window: “People are getting tired of scaling and are looking for new ideas.” AT&T’s chief data officer, on what mature enterprises will run in 2026: “fine-tuned SLMs.” Smaller models that do one job predictably, not bigger models that do everything approximately.
Reliability over cleverness has a specific shape. Pawel Huryn, summarizing where agents got useful in late 2025: “agents stopped being toys” once “hallucinations dropped and planning, instruction following, tool use, evals, guardrails, and recovery matured.” Instruction following sits in the middle of that list — not a side feature, the load-bearing one. A model that guesses what you meant can’t be trusted to run 50 steps unattended. A model that does what you wrote can.
4.7 is one data point on that curve. Your prompts will need to survive the rest of it too.
I actually enjoy using Opus now — it really is the best model for the next few days.
The next release will land, something will break, the fix will be the same: read what changed, rewrite what you’d left implicit, keep going.
Thank you to my founding members — Kostas Nasis, Artem Krivonos, and Kristina Hananeina. Your early support made this newsletter real.






