I rebuilt AI agent 10 times. Spoiler: it was never about cutting-edge model.

How harness engineering turned my failing system around.

Apr 08, 2026

The next AI model won’t fix your agents. I know because I used the same Claude for 10 versions of the same system. What actually changed my results was everything around the model — the instructions it reads, the tools it can use, the checks that catch its mistakes.

Everything you build around the model — the instructions, the tools, the error handling, the pipeline — developers call this the harness. Mitchell Hashimoto, who built Terraform, gave it a simple rule: every time your agent makes a mistake, engineer a fix so it never makes that mistake again.

All of this research comes from the engineering world. But I applied the same principles to my trendwatching agent — and it brought me 85K impressions on LinkedIn in 3 days by finding the hottest.

The Context

(skip if you want to apply harness asap)

When there’s no marketing budget, you ARE the marketing. For me that meant LinkedIn — and LinkedIn meant knowing what’s trending before everyone else. That was 3 hours of my morning, every morning. I tried to automate it. Ten times.

The first six versions were a slow education in what doesn’t work. I started by dumping everything into one Claude conversation — worked until the context window filled up and the model forgot what it had already read. Then I vibe-coded Python scripts with a scoring system, but spent more time calibrating scores than the system was saving me. Then I added 9+ sources across languages and the volume drowned the model in noise — hallucinations, missed items, everything still crammed into one agent’s context.

Version 7 broke me. I tried giving every single content piece its own dedicated agent. Three hours of compute that never resolved. Then the opposite — blind parallel pipelines with no monitoring. 45 minutes, garbage output. Both extremes failed for the same reason: I was engineering around the wrong problem.

Version 9-10 was a full rebuild. Monitored background agents instead of blind subprocesses. Evals at every phase with fallback instructions for when things break. An orchestrator that takes over when subagents fail. The specifics are the seven principles below.

Runtime went from 45 minutes to 15, reliability went from “maybe works” to something I’d actually stake my LinkedIn on — and it’s still the same Claude.

Here are 6 harness principles I applied:

1. Give the AI less, not more

Version 7 ran for 3 hours and never finished because the agent was doing everything — fetching data, routing APIs, handling files, analyzing, scoring, formatting output. I kept throwing more at Claude because that’s what felt natural — more tools, more responsibility, more context — and three hours later I had nothing.

I split the work — Python fetches, Claude analyzes. The scoring agent sees only the scoring rubric and the raw items, no pipeline logic, no historical context, and scores improved immediately.

Vercel learned this too. Their text-to-SQL agent had 17 tools. They deleted 15, gave the model bash and file access. 100% accuracy. 3.5x faster. 37% fewer tokens. The model wasn’t confused by the task — it was confused by the choices.

Birgitta Bockeler from Thoughtworks: “Increasing trust and reliability requires constraining the solution space rather than expanding it.”

2. Evals and Plan B

My clusterer received hundreds of scored items, silently dropped most of them, and delivered a report with a handful — acting like everything was fine. I only caught it because the output looked suspiciously thin.

The fix: after every phase, run a check. Did the output contain as many items as the input? Is the JSON valid? Did the scores fall within expected ranges? If any check fails, don’t retry the same agent — have the main system take over and do the work itself.

Three lines of eval logic after each step caught more failures than weeks of prompt tweaking ever did.

Here’s what my evals actually check:

After scoring: item count in = item count out. No drops.
After clustering: every scored item appears in exactly one cluster.
After writing: every item is in the report.

If any check fails, the orchestrator does that step directly. No retries, no second chances for the agent.

Stripe’s coding agents enforce the same pattern: can’t fix it in 2 rounds? Stop. Hand to human. Devin had no such limit — independent testing: 15% success rate.

3. One agent is usually better than many

I use multiple agents — but only for genuinely parallel work. When I need to score hundreds of items, I split them into batches and run scoring agents in parallel. Each batch is independent — item #47 doesn’t depend on item #248 — so parallelism works.

But the overall pipeline is strictly sequential because each phase depends on the previous one, and when I tried parallelizing phases that shouldn’t be parallel, it broke immediately — the clusterer was reading scores that hadn’t been written yet.

Princeton found single agents beat agent teams 64% of the time. Google tested 180 configurations — multi-agent degrades sequential tasks by up to 70%.

If your work is genuinely independent — split it across agents. If it’s sequential — keep it in one.

4. Build a filing system, not a memory dump

My early versions crammed everything into one conversation. By version 6, the context window was so full that Claude started forgetting earlier parts of the run — mixing up sources, duplicating scores, losing track of what it had already processed.

Each scoring agent now gets exactly one batch to read and one scored file to write — the orchestrator holds the big picture, but individual agents never see the whole pipeline, and keeping each agent’s context small and focused is what finally made your scoring reliable.

MEMORY LAYERS

Always loaded:  CLAUDE.md (the constitution)
On demand:      Batch files (one per agent)
Never loaded:   Full run history (searched, not read)

The system has a CLAUDE.md file — audience calibration, engagement normalization rules, some secret sauce — loaded at the start of every session. Not the full history — just the constitution. Hashimoto calls this AGENTS.md: every time an agent makes a mistake, the fix becomes a permanent rule in the file. My CLAUDE.md has grown the same way — each line traces back to a bad run.

This scales. 11 out of 12 models drop below 50% accuracy at 32,000 tokens. A well-curated 30K-token context beats an uncurated 120K-token context every time.

5. Kill before build

My biggest deletion: the entire subprocess architecture from version 7. Spawn, wait, hope — no monitoring, no intervention, no visibility. Replaced it with background agents the main system monitors in real time. If an agent fails, the system sees it in seconds and takes over.

Another one: I tried a cheaper model for scoring to save costs. It produced broken JSON that crashed everything downstream. Switched back to the expensive model for all agents, zero exceptions. Sometimes the “optimization” is the problem.

Manus deleted their todo tracker — it ate 30% of tokens doing nothing. Rentier Digital deleted their memory subsystem— 2.3-second latency improvement.

Audit your system regularly. If a component isn’t earning its cost — kill it.

6. Every step must earn its place

The compound reliability math:

I started with more phases than I have now — separate validation steps, pre-processing, post-processing, reformatting stages — and each one added marginal quality but cost reliability, so the net was negative and I cut back to the minimum where each phase does real work.

Same inside each phase. My scoring prompt started as a long document with examples, edge cases, scoring histories, anti-patterns. I trimmed everything around the core: the dimensions, the weights, the normalization table, and one strict rule — score EVERY item, drop NOTHING. Shorter prompt, better scores.

What I’d do differently on day 1

The counterintuitive lesson from 10 rebuilds: more is less. If you’re reaching for more agents, more tools, more context — it will probably make your system worse, not better.

If I started over tomorrow, I’d build the eval framework before writing a single agent, and I’d recommend you do the same. Every version that failed taught me the same thing: I didn’t know it was failing until it was too late. Evals first, agents second, optimization never.

Phil Schmid said it best: “The harness is the dataset. Competitive advantage is the trajectories your harness captures.” Every failure my system logs makes the next run better. That’s the real compounding — not a better model, but a smarter harness.

The 6-point harness audit

Before you rebuild anything, run this on what you have now. One “no” = the highest-impact fix you can make today.

Less, not more. Is your agent doing tasks that deterministic code could handle? (Fetching, routing, formatting — move to scripts. And check what the model actually sees — if the input is mangled, no prompt will save you.)
Evals and Plan B. What happens when a step fails silently? Do you have an assertion checking output count/quality after each phase?
Single agent default. Are you using multiple agents for work that’s actually sequential? (If step 2 depends on step 1’s output — that’s one agent, not two.)
Scoped context. Does each agent see only what it needs? Or does it get the full conversation history?
Recent deletions. When did you last remove a component that wasn’t earning its cost?
Step tax. How many steps in your pipeline? Multiply: 0.95^N = your theoretical reliability ceiling. Is every step worth the tax?

What’s the thing you kept trying to fix with a better model before you realized the model wasn’t the problem?

Thank you to my founding members — Kostas Nasis, Artem Krivonos, and Kristina Hananeina. Your early support made this newsletter real.

ship faster with ai

Discussion about this post

Ready for more?