Today's benchmark agents are legacy artifacts: a single LLM call, handed a flat wall of policy prose, asked to re-derive every rule from working memory on every turn. We propose the opposite. Treat each domain as a domain-specific harness over a capable agent runtime — Claude Code — packaged as a plugin of skills, scripts, sub-agents, MCP connectors, slash-commands, and deterministic hooks.
The thesis is simple and testable: policy placed at the right altitude beats policy crammed into one prompt. Soft judgment stays as natural-language skill; arithmetic becomes a script; hard invariants become hooks the agent cannot violate. We will measure this directly on τ²-bench pass-rate. And because the harness is observable, it sets up the longer game — agents that introspect their own failures and evolve their own harness.
01 · The turnThe software is not the code. It's the harness.
We have spent a while now building inside premier agent-harnesses — Claude Code, Codex, Cursor — and a conviction has hardened: the future of domain software is not legacy static code, and not free-form “vibe” code either. It is the harnessing of an LLM operating system for a domain, through natural-language procedures, rules, parallelization hints, and scripts that the agent composes on demand.
Anthropic's emerging standards make this concrete. Agent Skills, and their packaging into plugins — skills, slash-commands, sub-agents, scripts, MCP connectors, and deterministic hooks bundled together — let us harness the harness itself: a higher-order, domain-specific configuration of a general agent.
In Gradual Determinism we argued for governing a spectrum — from soft, judgment-laden natural language to hard, mechanically-enforced invariants — and for moving rules along it deliberately rather than all-or-nothing. This piece is the applied sequel. τ²-bench gives us a laboratory with a scoreboard: three customer-service domains, real policies, real tools, and an evaluator that grades the trajectory. We will rebuild its agents as harnesses, and let the numbers speak.
02 · The motivating anecdote“The API does not check this for the agent.”
Open the airline domain's policy and a confession appears — not once, but three times, verbatim:
“The API does not check that cancellation rules are met, so the agent must make sure the rules apply before calling the API!” data/tau2/domains/airline/policy.md
The policy is a microcosm of the whole problem. Inside its 166 lines live four different kinds of rule, each demanding a different treatment:
- Soft judgment. “Be helpful… deny requests against this policy… do not give subjective recommendations.”
- Exact arithmetic. Compensation is “$100 times the number of passengers” for cancellations, “$50 times” for delays. Baggage allowance is a 3×3 table of membership × cabin.
- Eligibility predicates. Cancel only if booked <24h ago or airline-cancelled or business or insured-and-covered. Compensate only if silver/gold or insured or business.
- Hard invariants. “Basic economy flights cannot be modified.” “Cabin cannot be changed if any flight has already been flown.” The number of passengers cannot change — “even a human agent cannot.”
Now look at what the legacy agent actually does with all of it. The entire policy is interpolated into a single system prompt, once, and the model is told to be careful:
# src/tau2/agent/llm_agent.py SYSTEM_PROMPT = """ <instructions>{agent_instruction}</instructions> <policy>{domain_policy}</policy> """ # every turn: system_prompt + history -> one generate() call. # the model must re-derive the $100-vs-$50 matrix, the 24h window, # and "basic economy cannot be modified" from memory, each time.
This is the determinism spectrum collapsed to a single point. An exact formula, a hard invariant, and a matter of taste are all rendered as the same medium — prose — and enforced by the same mechanism — the model's hope of remembering. The policy's own three-times-repeated plea is the tell: the author knew a guardrail was missing and could only ask the model, in English, to please supply it.
A rule that the system “must make sure of” but never checks is not a rule. It is a wish.
03 · The reframeOne policy, four altitudes, five primitives.
A harness lets each rule live where it belongs on the determinism spectrum. We stop asking one prompt to be judge, calculator, and bouncer all at once, and instead route each concern to the plugin primitive built for it.
Soft judgment
Tone, helpfulness, scope refusal, confirmation etiquette. Stays natural language — disclosed only when relevant.
Procedure
“Book a flight,” “cancel,” “modify cabin.” Named entry points that load just the steps for the task at hand.
Computation
Baggage tables, refund deltas, the $100×N / $50×N formulas. Deterministic code, not token-by-token arithmetic.
Invariant
“No modifying basic economy.” “No cancelling a flown segment.” Enforced before the tool call — refusable, not hopeful.
Concretely, here is how one domain's policy decomposes into one Claude Code plugin:
| Plugin primitive | Carries… | Airline example |
|---|---|---|
| skill | Progressive-disclosure procedures — loaded only when the conversation needs them. | book-flight, cancel-flight, compensation, each a focused SKILL.md instead of 166 flat lines. |
| command | Operator-facing entry points. | /cancel, /modify-cabin — route to the right skill with the right context. |
| script | Deterministic computation; removes arithmetic from the token stream. | baggage_allowance.py, refund_delta.py, comp_amount.py. (τ²'s own calculate tool exists because models are weak at this — a script retires the need.) |
| sub-agent | Isolated specialist reasoning with its own context window. | A policy-auditor that re-checks a proposed write against the rulebook before the user confirms. |
| hook | Deterministic guardrails on tool I/O. | PreToolUse blocks cancel_reservation unless the eligibility predicate passes — closing the exact gap the policy three times begged the model to close. |
| MCP | The domain tools themselves, as a connector. | Already built in this fork — tau2.mcp.unified_server exposes airline/retail/telecom tools over MCP. The plugin's hands exist; we are giving them a brain and a conscience. |
The hook is the heart of it. The legacy agent's failure mode is to confidently call cancel_reservation on an ineligible booking. A pre-tool hook turns the policy's plea into mechanism:
# hooks/precheck_cancel.py — PreToolUse on cancel_reservation def precheck(tool_call): r = get_reservation(tool_call.args["reservation_id"]) if any_segment_flown(r): return deny("flown segment — transfer required") if not (booked_within_24h(r) or r.airline_cancelled or r.cabin == "business" or insured_and_covered(r)): return deny("cancellation rules not met") return allow() # now the API call is safe by construction
04 · The wagerWhy a harness should out-score a prompt.
τ²-bench does not grade vibes. Its evaluator scores each trajectory against the task's evaluation_criteria — did the correct tool actions fire, did the database end in the right state, was the policy honored. That makes our claim falsifiable, and we intend to report it as a pass-rate delta, head-to-head, same models, same tasks. Four mechanisms should move that number:
Less to hold at once. Progressive disclosure means the model reasons over the cancellation procedure during a cancellation — not the booking, baggage, and insurance rules simultaneously. Smaller live context, fewer cross-contaminated mistakes.
Arithmetic leaves the token stream. A $50-vs-$100 confusion, a botched refund delta, an off-by-one baggage allowance — these are scored as failures today. A script cannot make them.
Invariants become unviolatable. The single most damaging failure — executing a write the policy forbids — is intercepted by a hook before it reaches the tool. The model no longer has to remember not to; it cannot.
We are deliberately not promising a number in advance. The honest framing is that this is a wager with a scoreboard: if the harness does not beat the monolith on pass-rate, the thesis is wrong and we will say so. We believe it will — and that the margin grows on exactly the hard, conditional, arithmetic-laden tasks where flat prompting is weakest.
05 · The longer gameHarnesses that watch themselves, and evolve.
A harness is not just more accurate — it is observable. Every skill that loads, every script that computes, every hook that allows or denies is a structured event. We weave that observability in from the start, then close the loop. τ² already supplies the other half: a verdict on every run.
This is where the gradual-determinism thesis pays its dividend. When a class of failures recurs, the fix is not to re-roll a giant prompt and pray — it is a localized, legible edit at a known altitude: a new precondition in one hook, a clarified step in one skill, a corrected constant in one script. The harness becomes a living document of hard-won domain knowledge, and the agent itself can propose the patch, justify it from the trace, and validate it against the scoreboard before the change lands.
That is the destination: domain-specific harnesses that author and refine themselves, with determinism added exactly where evidence demands it — no sooner, no later. Self-improving software whose improvements are auditable, reversible, and grounded in measured outcomes rather than vibes.
06 · What we are buildingThe plan, in one breath.
For each τ²-bench domain — airline, retail, telecom — we ship one Claude Code plugin: skills that decompose the policy by altitude, scripts that own the arithmetic, sub-agents for isolated audit, the fork's existing MCP servers as the tool connector, slash-commands as entry points, and hooks that make the “must make sure” clauses true by construction. Then we run it against the legacy LLMAgent, same models, same tasks, and we report the pass-rate delta plainly. Then we instrument it, and we let it begin to improve itself.
We are not writing better agents. We are building the harness in which a capable agent becomes a domain expert — and then learns.