Harness Design for Long-Running AI Agents.

The missing layer most teams skip.

Over the past year, building production agentic systems for clients across fintech, legal, and compliance, I've watched the same failure play out more times than I'd like to admit. The demo works. Staging works. The system ships. Then, somewhere around the 40-minute mark on a task, the agent starts doing something nobody expected, repeating work it already finished, contradicting decisions it made an hour ago, producing output that's confidently, obviously wrong. Nothing crashed. No error in the logs. The model is still running. It's just gone off the rails.

Every time this happens, the instinct is to blame the model. Swap the prompt. Try a different temperature. Upgrade to the next version. Sometimes that helps at the margin. But in almost every case I've diagnosed, the real problem was the same thing: the harness wasn't designed.

The harness is everything around the model, the scaffolding that manages context, coordinates agents, handles failures, and keeps the system coherent over time. In my experience, teams spend weeks on prompt engineering and almost nothing on harness design. That's where most long-running agentic systems fail, and it's where most of the interesting engineering work actually lives.

What a harness is: and what it isn't

The harness is not the model. It's the system that makes the model useful in production. It covers how context is managed as tasks grow longer, how work gets decomposed into steps the agent can execute reliably, how failures are detected and handled, how multiple agents coordinate, how state moves between sessions, and how outputs get evaluated before anything acts on them.

I want to draw a distinction here that matters in practice. Context engineering, deciding what goes into the context window at each step, is one piece of harness design, and it's gotten a lot of attention recently. But harness design is broader. You can have excellent context engineering and still ship a system that collapses at production volume because nobody designed the evaluation architecture, or defined what happens when a step fails twice in a row, or thought about how state survives a restart. I've inherited systems like this. The context was beautifully managed. Everything else was held together with assumptions that turned out to be load-bearing walls.

The analogy I use: in traditional software, you wouldn't ship a service without error handling, retry logic, and observability. Nobody calls those features, they're infrastructure. Harness design is the same thing for agentic systems. The reason teams skip it is the same reason teams skip any infrastructure work early in a project: it doesn't matter until it does, and by the time it matters, you're already in production.

Four failure modes I've seen in production

After enough of these retrospectives, the failure modes start to look familiar. Here are the four I've hit most often.

Context pollution

Context pollution is the gradual kind. As the agent works through a long task, the context window fills with earlier steps, tool outputs, intermediate decisions, and accumulated state. The model is working with all of it simultaneously, and at some point the weight of that history starts to degrade coherence. The agent doesn't crash, it gets progressively worse. Later decisions contradict earlier constraints. The overall goal drifts. The outputs look reasonable in isolation and fall apart in aggregate. It's the kind of failure that's hard to catch in testing because short tasks don't surface it.

Context anxiety

Context anxiety is less well-documented but I've seen it consistently. Some models, as they approach what they perceive as their context limit, start wrapping up work prematurely. They produce a final output, sometimes clearly incomplete, as if they're trying to finish before they run out of room. The system reports success. The output looks like a completion. Only when a human actually reads it does it become clear that half the work was skipped. In one compliance system we built on Claude Sonnet 4.5, this behavior was pronounced enough that context resets, not compaction, full resets with structured handoffs, became a core part of the harness. Compaction keeps the same agent running on a shortened history. That addresses context pollution, but it doesn't give the agent a clean slate, so the anxiety behavior persisted. Only a genuine reset resolved it.

Self-evaluation bias

Self-evaluation bias is structural, not model-specific. Ask an agent to assess the quality of its own output and it will almost always tell you the output is good, even when, to a human reviewer, it's obviously mediocre. This is particularly sharp on subjective tasks, but it shows up on verifiable ones too. In a compliance monitoring system processing sales calls, the evaluator, initially the same agent as the processor, was accepting reports that any human reviewer would flag immediately. What changed things wasn't tuning the processor. It was separating the evaluation into a completely different agent with its own context, its own instructions, and explicit incentives to find problems. That agent was still an LLM with the same general tendency toward generosity, but tuning a standalone skeptical evaluator turned out to be far more tractable than making a generator critical of its own work.

State loss between sessions

State loss between sessions is the fourth, and often the most expensive when it hits. Many production agentic systems don't run in a single continuous session. They pause, hand off, restart, because they're long, or because something failed, or because they're running in parallel across multiple instances. If the handoff artifact isn't designed carefully, the next session starts without a complete picture of what's been done, what decisions were made, and what constraints apply. In a fintech system processing loan applications, we saw agents duplicating work that had already been completed and contradicting decisions from earlier sessions, not because the model was confused, but because the state it was handed was incomplete. The fix wasn't a better model. It was a better handoff artifact.

The five patterns that address them

These aren't framework-specific. They're architectural decisions that apply regardless of what tooling you're using, and they're worth making explicitly before you build, not after you've already hit the failure mode in production.

Context resets with structured handoffs

Context resets with structured handoffs address pollution, anxiety, and state loss together. The approach is to design the system to reset the context at defined intervals and hand off structured state to a fresh agent instance, rather than letting context grow indefinitely or relying on compaction alone. The handoff artifact is where most of the design work lives. It needs to carry what has been completed (with enough detail that the next agent doesn't redo it), what decisions were made and why (so the next agent doesn't contradict them), what comes next (the specific next step, not a general description), and the current state of any shared artifacts, files, databases, git history, whatever the task touches. The reset gives the next agent a clean context window. The handoff artifact gives it everything it needs to continue coherently. This adds orchestration complexity and token overhead, which is a real cost. But on tasks where context anxiety is pronounced, it's the only intervention that actually works.

Separating the generator from the evaluator

Separating the generator from the evaluator addresses self-evaluation bias. The principle is simple: never rely on the same agent to assess the quality of its own output for anything that matters. The evaluator needs its own context window, one that didn't participate in building the output. It needs clear, specific criteria to grade against, not "is this good?" but "does this meet the defined acceptance criteria?" And it needs the tools to verify the output directly: for code, that means running it; for documents, reading them; for APIs, calling them. In the compliance system I mentioned, the evaluator used Playwright to navigate the actual running application the way a user would before scoring each criterion. That interaction with the live system, not a static review, was what made the evaluations meaningful. A skeptical evaluator with those capabilities is far more effective than any amount of self-critique from the generator.

Sprint-based decomposition with pre-defined contracts

Sprint-based decomposition with pre-defined contracts addresses pollution, bias, and state loss simultaneously. The idea is to break long tasks into bounded units of work with a defined contract for each: what will be built, how success will be verified, and what the handoff to the next sprint looks like. Critically, the contract is negotiated before any work starts, agreed between the generator and the evaluator before a line of code is written. This eliminates the most common source of wasted work: building the right thing the wrong way, or the wrong thing entirely, and only discovering it at evaluation time. It also limits the blast radius of any single failure. If sprint 4 produces mediocre output, you reset from the sprint 3 handoff, not from the beginning.

Explicit failure handling and escalation paths

Explicit failure handling and escalation paths are the infrastructure layer most agentic systems are missing. The questions to answer before you ship: what triggers a retry versus a fallback versus a human escalation? What state is preserved across retries? When a sprint fails evaluation, does the generator get one more attempt or does it escalate after the first failure? What does human-in-the-loop actually look like for this system, a synchronous review, an async notification, a dashboard flag? These decisions feel premature when you're building. They become urgent at 2am when the system is running a critical task and something unexpected happens and there's no defined path forward.

Observability as a first-class requirement

Observability as a first-class requirement is what makes all the other patterns debuggable. Structured logging that captures not just errors but the agent's reasoning at key decision points, the content of handoff artifacts, evaluator scores over time, and the rate of sprint failures and retries. Without it, debugging a long-running agent failure means reading thousands of lines of raw output to reconstruct what happened. With it, you can see exactly where coherence started to degrade, whether evaluator scores were trending in the wrong direction before a failure, and which failure modes you're actually hitting. In the compliance system, adding this layer was what let us trace the self-evaluation bias back to its source, the evaluator's scores had been drifting positive for three sprints before the output quality became visible to a human reviewer.

When to invest in this: and when to skip it

Not every system needs a full harness, and recommending one universally would be the same mistake as skipping it entirely.

You need it when the agent runs for more than 10 to 15 minutes on a single task, when the task involves more than five or six sequential steps, when multiple instances need to coordinate or hand off state, when the output goes into production without human review of every detail, or when the cost of a wrong output is significant, compliance, financial, medical, legal.

You can defer it for single-turn or short-session agents: Q&A, extraction, classification. Internal tools with low-stakes outputs where human review happens naturally. Early PoC work where you're validating feasibility, not building for production.

The test I use: if a human doing the same task would need to take notes to track their own progress, the agent needs a harness. If the task fits in one sitting without notes, a simpler design is probably sufficient.

One thing I've learned the hard way: harness design is significantly harder to retrofit than to design upfront. The data structures, agent boundaries, and evaluation checkpoints that a good harness requires tend to shape the whole system. Building around assumptions that ignore them means those assumptions become load-bearing walls. Build a minimal harness from the start. It's far easier to expand it than to add one to a system that wasn't designed for it.

What it looks like when it works

The compliance monitoring system I've referenced throughout this post is the clearest example I have of what harness design actually changes in production. The system processes sales calls end-to-end, extracting structured compliance data, validating against regulatory requirements, flagging violations, with multiple agent instances running in parallel at volume.

Before we designed the harness properly, the system had all three of the failure modes I described. Context pollution from parallel runs bleeding into each other through shared state was producing inconsistent reports for calls with similar content. The evaluator, the same agent as the processor, was accepting reports that any human reviewer would immediately flag. And when the system restarted after a failure, some calls were processed twice while others were skipped entirely, with no recovery path.

The harness changes were not complicated. We isolated context per call, eliminating shared state between parallel runs. We built a separate evaluator agent tuned specifically to the criteria of compliance review, not a general evaluator, but one calibrated to the exact standards the compliance team applied. And we moved job state tracking to an external store, so the harness always knew which calls had been processed, which were in progress, and which needed to be retried.

The result: the system now runs at production volume with a human review rate under 8%. The other 92% of reports go directly to the compliance team's queue without manual intervention. The model didn't change. The infrastructure around it did.

If you're building something like this

Most agentic systems don't fail because the model isn't capable enough. They fail because the infrastructure around the model wasn't designed to keep it coherent, evaluated, and recoverable over time.

The patterns aren't complicated. Context resets, external evaluation, sprint decomposition, explicit failure handling, observability. The challenge is designing for them before you need them, not after you've already hit the failure mode at 2am with a production system running a task that matters.

If you're building a long-running agentic system and want to walk through the harness design before you ship it, we're happy to have that conversation.

Let’s build together

We combine experience and innovation to take your project to the next level.