Governing the Determinism Spectrum in Agent Harnesses

The pitch, in one breath

Modern agent harnesses generate, execute, and verify code with remarkable autonomy — and almost no guarantees. An agent can paraphrase a compliance check, skip a validation, or route around a safeguard, because today those controls live in the same operational space the agent is free to reinterpret.

We propose treating determinism as a spectrum we govern rather than a property we wish for. The standardization of skills, commands, plugins, and hooks gives us a set of control surfaces — each one a place to convert probabilistic behavior into a guarantee. The value of a surface is set by how far outside the agent's discretion it sits. We call the practice gradual determinism: keep the agent dynamic everywhere it earns its keep, and pin down only what must not vary.

01 / THE PREMISE

The controls are in the wrong room

The power of a harness like an autonomous coding agent is precisely its freedom: it interprets intent, writes the code, runs it, and decides when it is done. That freedom is not a bug to be removed. But a control that the agent can rewrite is not a control — it is a suggestion.

A skill that says “validate the license headers” depends on the agent choosing to validate, choosing to do so correctly, and not quietly deciding the step was unnecessary this time. When the rule and the actor share a room, the actor always wins. The fix is not to constrain the agent's intelligence. It is to move the rules into a room the agent cannot enter — and to do so only where a guarantee is actually worth its cost.

02 / THE FRAME

Five surfaces, one axis

The mechanisms we already have line up along the anatomy of the agent loop. Commands govern its entry. Scripts govern execution. Hooks govern the exit. Loops govern continuation. Wrappers own the whole loop. Read left to right, that ordering also tracks two things at once: how much is guaranteed, and how far the control sits outside the agent's reach.

← agent discretion · in-band operator guarantee · out-of-band →

It is, strictly, two dimensions: the breadth of the guarantee and its tamper-resistance. A skill is fully in-band — the agent can paraphrase or skip it. A hook is enforced by the harness runtime whether or not the agent cooperates. That gap is what separates a pattern from a control.

03 / THE SURFACES

Where determinism is bought

Commands

Routing determinism

governs entry

For three sprints, the “run the compliance gate” skill simply never fired. The agent kept deciding, in good faith, that it had already covered the spirit of the rule. Nobody could point to the run that proved it. We exposed the gate as /comply — a command the reviewer types — and the question “did the gate run?” became answerable for the first time.

Skill selection belongs to the agent; the user cannot compel a skill to fire. Commands invert that. A command is operator-invoked, so it is a routing guarantee: a named, auditable entry point into a known procedure. You cannot force the agent to choose the right path, but you can hand the operator a path the agent cannot decline to take.

GuaranteeThis procedure ran, because a human invoked it — not because the model judged it relevant.

Scripts

Computational determinism

governs execution

The settlement-rounding step came out a cent different across runs — never wrong enough to fail a test, never right enough to trust. The skill had described the calculation in prose and left the agent to re-derive it each time. We froze the reviewed implementation into a script and reduced the skill phase to “invoke it.” The cent stopped moving.

A skill phase written in natural language leaves interpretation — and code generation — to the agent at runtime. That is the right default for novel work. But once a phase has been generated, reviewed, curated, and approved, the same code can be packaged as a script and the phase description reduced to an instruction to adapt or invoke it. The non-determinism of regeneration collapses into the determinism of a frozen, version-controlled artifact.

GuaranteeThis step computes the same way every time, from code a human approved — not freshly improvised prose.

Hooks

Verification determinism

governs exit

“Done” meant whatever the agent felt was done — sometimes with a failing test suite and a missing license header in the diff. We put a gate at the end of the turn. Now “done” has to survive a checker that does not negotiate.

At a hook point — the stop-hook fires when the agent tries to end its turn — deterministic code inspects the output for structural and semantic properties. And the loop can be re-driven from here: a stop-hook that returns a block decision with a reason prevents the agent from stopping and feeds that reason back as its next instruction. The hook becomes a hard exit gate. One guardrail is mandatory — the harness passes a stop_hook_active flag, true when the agent is already in a forced continuation, and the hook must honor it or it will loop forever.

#!/usr/bin/env python3
"""Stop-hook gate: refuse to end the turn until the build is verifiably clean.

The harness invokes this when the agent tries to stop. Emitting a "block"
decision returns control to the agent with `reason` as its next instruction;
exiting 0 silently lets it stop.
"""
import json, subprocess, sys

payload = json.load(sys.stdin)

# Honor the forced-continuation flag, or this gate loops forever.
if payload.get("stop_hook_active"):
    sys.exit(0)

checks = subprocess.run(["make", "verify"], capture_output=True, text=True)
if checks.returncode != 0:
    print(json.dumps({
        "decision": "block",
        "reason": f"Verification failed; fix before finishing:\n{checks.stdout}",
    }))
sys.exit(0)

Where the property needs judgment rather than an exit code — “is this actually complete?” — a prompt- or agent-type hook runs a cheap evaluator in the same slot. The shape is identical; only the oracle changes.

GuaranteeThe turn cannot end until the output passes a check the agent does not control — and failure re-drives the work.

Loops

Conditional continuation

governs iteration

The migration guide needed four passes before the style linter went quiet. Doing those passes by hand — re-prompting, re-reading the report — was the least deterministic part of an otherwise tidy pipeline. A /loop turned “keep going until it's clean” from a habit into a construct.

A /loop-style construct runs a task repeatedly until a satisfactory outcome is reached. This is not deterministic on its own — but it raises the floor of what we can expect. The honest caveat: a loop is only as deterministic as its verifier. A loop without an oracle just burns turns. Which is why loops and hooks compose naturally — the hook is the oracle, the loop is the driver — and the pair converges toward a property you can actually name.

GuaranteeThe harness keeps working until a stated condition holds — as trustworthy as the condition you can verify.

Wrappers

Orchestration determinism

governs the whole loop

The overnight batch was fine right up until the morning it wasn't — an agent applied a plan no one had seen. We did not want it slower; we wanted a hard gate between plan and apply, owned by something that was not the agent. So we wrote the gate ourselves and called the harness from inside it.

A custom wrapper drives the harness through its API or CLI — a more bespoke, more deterministic relative of /loop. And the multi-turn question resolves cleanly: a first headless call returns a session id, and subsequent calls resume that same session, preserving full context across invocations — exactly the thread-id pattern familiar from graph-based orchestration SDKs. The control flow now lives in code the operator owns. The agent supplies capability; the wrapper supplies the guarantee that nothing is applied unreviewed.

"""Deterministic envelope around an agent harness.

The orchestrator — not the agent — owns the control flow: it opens a
session, holds it open across turns by session id, and inserts a hard
human-approval gate between *plan* and *apply*.
"""
import json, subprocess

def run(prompt, session=None):
    """Invoke the harness for one turn; return its parsed JSON result.

    :param prompt: the operator message for this turn.
    :param session: an existing session id to resume, or None to start one.
    :returns: the result, including `session_id` for continuation.
    """
    cmd = ["claude", "-p", prompt, "--output-format", "json", "--max-turns", "8"]
    if session:
        cmd += ["--resume", session]
    return json.loads(subprocess.run(cmd, capture_output=True, text=True).stdout)

plan = run("Plan the migration. Do not modify any files yet.")
sid = plan["session_id"]

if input("Approve this plan? [y/N] ").strip().lower() == "y":
    run("Apply the approved plan exactly as described.", session=sid)

GuaranteeThe entire control flow — gates, approvals, retries — lives in code the operator owns; the agent is a callable primitive inside it.

04 / COMPOSITION

Deterministic skeleton, stochastic muscle

These surfaces are not alternatives. They stack. A mature harness uses a command to route into a known procedure, a script to perform the step that must compute identically, a hook to verify the result and re-drive on failure, and a wrapper to own the approvals around the whole exchange. The agent keeps its dynamism in the gaps between — which is exactly where dynamism is worth having.

We are not making the agent deterministic. We are deciding, surface by surface, which invariants the agent is no longer permitted to violate.

It is worth being precise about what this buys. Generation remains stochastic; the model still improvises. What we gain is guaranteed invariants at chosen checkpoints — islands of determinism around a probabilistic core. That is a more honest claim than “deterministic agents,” and a more useful one.

05 / WHY IT MATTERS

The control surface is the audit surface

Compliance, security, and governance do not ask for cleverness. They ask: can you show, after the fact, that the rule held? Every surface on this spectrum is code — a command definition, a frozen script, a hook, an orchestration wrapper. Code is versionable, reviewable, signable, and attestable. The same move that makes a control tamper-resistant against the agent also makes it legible to an auditor.

That is the quiet payoff of pushing controls out-of-band. We stop hoping the agent behaved and start being able to demonstrate the envelope it ran inside. For any setting where an autonomous system touches regulated work, that shift — from trust to evidence — is the whole game. Gradual determinism is how we get there without giving up the autonomy that made the harness worth adopting in the first place.