LangChain jumped from outside the top 30 to rank 5 on TerminalBench 2.0 — a 13.7-point improvement — without changing the model. Only infrastructure: LocalContextMiddleware for environment mapping, LoopDetectionMiddleware for tracking per-file edit counts, and self-verification loops.
The same week, Can Boluk improved 15 LLMs at coding in one afternoon by changing only the edit format. Grok went from 6.7% to 68.3%. Cost: about $300. His observation: "+8% improvement in Gemini is bigger than most model upgrades deliver, and it cost zero training compute."
The harness — the complete infrastructure wrapping the model — determines whether an agent works in production. Not the model.
What a harness actually is
Vivek Trivedy at LangChain gave it a name: "If you're not the model, you're the harness." The harness is every piece of code, configuration, and execution logic that isn't the model itself: orchestration loop, tools, memory, context management, state persistence, error handling, guardrails.
Beren Millidge built the analogy in 2023: the LLM is the CPU, the context window is RAM, vector databases are disk, plugins are I/O drivers, and the harness is the operating system. It's a good analogy because it captures the essential point — a CPU without an OS is useless.
Birgitta Bockeler formalized the engineering on Martin Fowler's site in April 2026: feedforward controls (guides that steer before the agent acts) vs feedback controls (sensors that observe and correct after). Computational checks (tests, linters — fast, deterministic) vs inferential checks (LLM-as-judge — slow, semantic). These aren't suggestions; they're the control theory underneath every production agent.
My framing: the model is 20% of the work. The other 80% is the architecture between the model and the user. That 80% is the harness.
The file system is the memory
Here's the pattern I keep seeing in production agents: they don't use vector databases for working memory. They use the file system.
Claude Code's memory is three tiers of files. A lightweight MEMORY.md index with ~150 characters per entry, always loaded. Detailed topic files pulled on demand. Raw transcripts for search-only access. No embeddings, no similarity search, no vector store. Just files.
Manus's todo.md pattern is even more revealing. They constantly rewrite the task list so it sits at the end of the context window — exploiting the transformer's recency bias. The most recent tokens get the most attention. By keeping the active plan in a file that's always re-read, the agent never loses track of what it's doing. The file system exploits an architectural property of the model itself.
The Ralph Loop pattern from Anthropic's engineering blog shows this at session scale: between sessions, the agent reads claude-progress.txt and feature_list.json from disk, checks git logs, picks the next unchecked item, implements it, commits, writes a summary. The file system provides continuity that context windows can't.
And when Anthropic built a C compiler with 16 parallel agents, they coordinated by writing lock files to a current_tasks/ directory. No distributed coordination protocol. Just files and git push/pull. 100,000 lines of Rust, ~2,000 sessions, about $20,000.
The file system is the one interface all coding agents share. It's unlimited storage, directly operable, and it persists across sessions. This is why coding agent patterns keep winning for complex tasks — the file system IS the memory layer.
The tool paradox
More tools make agents worse, not better. This sounds wrong until you see the data.
Vercel's text-to-SQL agent had 15 specialized tools: GetEntityJoins, LoadCatalog, RecallContext, SearchSchema, ClarifyIntent, and 11 more. They deleted 80% and replaced everything with two capabilities: ExecuteCommand (bash) and ExecuteSQL. Accuracy went from 80% to 100%. Speed improved 3.5x. Tokens dropped 37%. Steps dropped 42%.
Their explanation: "We were constraining reasoning because we didn't trust the model to reason."
GitHub found the same pattern from the other direction. They reduced default tools from ~40 to 13 core tools and built an embedding-based routing system. Embedding-guided selection: 94.5% tool use coverage. Static default list: 69%.
Manus's approach is the most sophisticated: instead of dynamically adding and removing tools (which breaks the KV cache), they mask token logits during decoding. Tool definitions stay in context to preserve cache hits, but the model is constrained at the token level from selecting inappropriate tools. Action names use consistent prefixes (browser_, shell_) to enable state-based masking.
The lesson connects directly to the MCP token cost problem I wrote about earlier. Every tool definition competes with your agent's reasoning space. More tools, worse thinking.
Sandboxing is the enabler
Agents can safely execute code because sandbox technology got fast enough.
Firecracker microVMs boot in ~125ms from cold, but 28ms from snapshot via copy-on-write memory overlays. ~5MB overhead per VM. Each gets its own Linux kernel — kernel exploits inside one VM can't reach the host or other VMs. Only 4 emulated devices: virtio-block, virtio-net, serial console, power button. That's the entire attack surface.
Vercel Sandbox runs on this: Firecracker + Amazon Linux 2023, up to 8 vCPUs, billed only for active CPU time. Codex takes a different approach: no VM at all, OS-native sandboxing via Bubblewrap + seccomp on Linux. Read-only root with specific writable bind mounts. They deliberately do NOT block recvfrom because cargo clippy uses socketpair for subprocess management — a careful security/compatibility tradeoff that shows how deep the engineering goes.
Claude Code's auto mode adds another layer: a Sonnet 4.6-based safety classifier that is deliberately "reasoning-blind." It receives only user messages and tool calls — never Claude's own reasoning or outputs. This prevents the agent from persuading its own safety system through self-generated justifications. Two-stage pipeline: fast single-token filter (8.5% false positive rate), then chain-of-thought reasoning on flagged actions (0.4% false positive rate).
The Kubernetes ecosystem is catching up. kubernetes-sigs/agent-sandbox provides a declarative CRD for stateful agent workloads: stable identity, persistent storage, SandboxWarmPool for pre-warmed pods.
Who's paying for harnesses
If the harness is just infrastructure, why is it worth billions?
Meta acquired Manus for ~$2 billion in December 2025 — not for the model (they have Llama), but for the harness. As one analysis put it: "99% of the value is in getting the job done, not starting it."
OpenAI's million-line experiment: ~1 million lines of code across 1,500 pull requests, zero lines written by human hands. The humans were "designing the environment that made reliable code generation possible." The harness was the product. The code was the output.
Stripe's Minions ship 1,300 PRs per week on pre-warmed EC2 devboxes that spin up in 10 seconds through proactive provisioning. Each agent accesses ~500 tools via their centralized MCP server (Toolshed), but receives an intentionally small subset. Blueprints mix deterministic nodes (linters, test selection from 3M+ tests) with agentic nodes (implementation, CI failure fixing). The infrastructure Stripe built for human engineers over years of tooling investment is what makes 1,300 AI PRs/week possible.
And Anthropic's own harness comparison: a solo run (single agent, no harness) cost $9 in 20 minutes but produced a broken game — "entities appeared on screen but nothing responded to input." The full harness cost $200 over 6 hours but delivered a functional application. The harness was the difference between a demo and a product.
The counterexamples
Not every agent needs to write code.
No-code agent platforms (n8n, Lindy, Konverso) are thriving with significant venture funding. Simple chat, scheduling, FAQ — these work through API orchestration and structured tool calls without general code execution.
Travel agents like the Sabre/PayPal/MindTrip pipeline orchestrate APIs without writing code. They're complex but bounded.
The APEX-Agents benchmark is telling: frontier models achieve only 24% on professional tasks despite 90%+ on traditional benchmarks. The failures "were not knowledge failures" but "orchestration problems" — locating files, interpreting notes, resolving ambiguity. These are infrastructure problems, not model problems. But they don't all require general code execution to solve.
My honest framing: every agent that outgrows a demo converges on the same infrastructure — file system, shell, verification loops, state persistence. Not every agent outgrows the demo. The question is whether yours needs to.
Design for deletion
Here's the paradox: the harness IS the product, but it should get thinner over time.
Anthropic's harness design paper: "Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing — both because they may be incorrect, and because they can quickly go stale as models improve." With Opus 4.6, they eliminated sprint decomposition, moved evaluation to single pass, and dropped context resets entirely.
Manus rebuilt their harness five times in six months, each time removing user-facing complexity while adding targeted internal infrastructure. KV-cache hit rate became "the single most important metric for a production-stage AI agent."
OpenAI's Codex team put it bluntly: "If you rely on complex scaffolding to build AI agents you aren't scaling, you are coping."
And the Meta-Harness project proved the logical endpoint: an LLM optimizing its own infrastructure achieved 76.4% on TerminalBench, surpassing hand-designed harnesses. The agent wrote its own harness. The agent then simplified its own harness.
The next time your agent fails, don't blame the model. Look at the harness.