A Position Piece · A Governed AI Harness for NSW Legal Practice
A lawyer carries the liability for everything the AI produces. Our job is to make deterministic the parts of that work that must never be left to chance.
The pitch, in one breath
A solicitor remains professionally and ethically accountable for AI output they have no accessible way to verify. Confidence scores and source traces are built for engineers, not for how a lawyer reasons about reliability — so verification collapses into blind trust or a full manual re-do that erases the time saved. The person carrying the risk is the one least equipped to discharge it.
The answer is not a better confidence score. It is deciding, deliberately, which parts of legal work must not be stochastic — the conflict checkConflict of interestA solicitor generally cannot act where duties to one client clash with duties to another, current or former. Firms run a conflicts search against their client records before taking a matter on.Solicitors' Conduct Rules, rr 10–11 ↗, the privilegeClient legal privilegeA client's right to keep confidential lawyer–client advice (s 118) and material prepared for litigation (s 119) out of evidence. Disclosing it inadvertently can waive it for good.Evidence Act 1995 (NSW) s 118 ↗ call, the limitation dateLimitation periodThe statutory deadline to start a claim (generally six years for contract or tort). Miss it and the claim is barred — a classic source of negligence claims against firms.Limitation Act 1969 (NSW) ↗, the cited authority, the costs disclosureCosts disclosure · s 174NSW solicitors must tell a client in writing, as soon as practicable, how fees are calculated and give an estimate of total legal costs.Legal Profession Uniform Law (NSW) s 174 ↗ — and guaranteeing them in code the model cannot route around. We call this gradual determinismOur framingBorrowed from “gradual typing” in programming: instead of making the whole AI deterministic, you opt into hard guarantees only at the points that must not vary, and leave it fluent everywhere else.: let the AI stay fluent and fast everywhere it earns its keep, and pin down the handful of things that, if wrong, end careers. The result is a domain harness for NSW practice in which the safeguards are guarantees, not suggestions.
01 / THE PREMISE
An AI that usually runs the conflict check, usually gets privilege right, and usually cites a real case is not a safeguard — it is a liability with good odds. When the rule and the actor share a room, the actor wins; a model can paraphrase a compliance step, decide a validation was unnecessary this time, or fabricate an authority that reads perfectly. The fix is not to constrain the lawyer's judgment or the model's fluency. It is to move the rules that must hold into a layer the model cannot enter.
This also answers the most honest objection of all — that legal tech tends to add a step rather than remove one, leaving the lawyer to check, reconcile and correct — shifting work from doing to reviewing. A deterministic check is the opposite of another review for the lawyer to perform. The harness runs the verification, and refuses to finish until it passes. The re-do is removed, not relocated onto the person who was already carrying the liability.
02 / THE RISK MAP
The fears that boutique firms voice about legal AI sort cleanly onto the control surfaces below. Most are direct: a guarantee can be placed exactly where the fear lives. Two — the billing questions — are commercial decisions the harness supports with evidence rather than solves, and are marked accordingly.
03 / THE FRAME
The mechanisms already standardised in modern agent harnesses line up along the anatomy of the matter. Commands govern its entry. Scripts govern execution. Hooks govern the exit. Loops govern continuation. Wrappers own the whole workflow. Read left to right, the ordering tracks two things at once: how much is guaranteed, and how far the control sits beyond the model's reach. A skill the model can paraphrase is in-band; a hook the harness enforces regardless of the model is out-of-band — and that gap is the whole difference between a pattern and a control.
04 / THE SURFACES, IN NSW PRACTICE
Skill selection belongs to the model; the lawyer cannot compel a skill to fire. Commands invert that. A command is operator-invoked — a named, logged, auditable entry point into a procedure that runs the same way for every fee-earnerLegal-firm termAnyone in the firm whose time is billable — solicitors, paralegals — as distinct from support staff. in the firm. You cannot make the model choose the right path; you can hand the lawyer a path the model cannot decline.
whether the conflict, costs and AML steps happened at all.
A skill phase written in natural language leaves both interpretation and code generation to the model at runtime — the right default for novel work, the wrong one for a calculation that must come out the same every time. Once a phase has been generated, reviewed and approved, the code is frozen as a versioned script and the phase reduced to “invoke it.” The non-determinism of re-derivation collapses into a reviewed, auditable artifact.
deadlines, trust arithmetic, and the contents of the evidentiary record.
Hooks fire deterministic code at fixed points in the loop. Their headline job here is at the exit — inspecting the turn's output before it is allowed to be “done” — but the same mechanism also gates each tool call: a PreToolUse hook can allow, deny or rewrite a call before it runs, and a PostToolUse hook inspects the result once it returns (though it cannot undo a call that already executed). Any of them can gate the work, and a stop-hook can re-drive it: returning a block decision with a reason prevents the model from stopping and feeds that reason back as its next instruction, so a failed check re-drives the work rather than landing on the lawyer's desk. The checks are not another review the lawyer must perform — they are gates the work has to pass. One guardrail is mandatory at the exit: the harness exposes a flag that is true when the model is already in a forced continuation, and the stop-hook must honour it or it will loop forever.
Where the property needs judgment rather than a pass/fail rule — “does this advice actually address the question asked?” — a prompt- or agent-type hook runs a cheap evaluator in the same slot. The shape is identical; only the oracle changes.
#!/usr/bin/env python3 """Stop-hook gate: a draft cannot leave the harness until every cited authority is verified and no privileged material is marked producible. The harness invokes this when the model tries to end its turn. Emitting a "block" decision returns control to the model with `reason` as its next instruction; exiting 0 silently lets the turn finish. """ import json, sys from firm.verify import unverified_citations, privilege_breaches payload = json.load(sys.stdin) # Honour the forced-continuation flag, or this gate loops forever. if payload.get("stop_hook_active"): sys.exit(0) draft = payload["last_output"] problems = unverified_citations(draft) + privilege_breaches(draft) if problems: print(json.dumps({ "decision": "block", "reason": "Resolve before finishing:\n- " + "\n- ".join(problems), })) sys.exit(0)
cited authorities, privilege, confidentiality, and disclosure.
A loop construct runs a task repeatedly until a satisfactory outcome is reached. It is not deterministic on its own — but it raises the floor of what the firm can expect, and it composes naturally with hooks: the hook is the oracle, the loop is the driver. The honest caveat is that a loop is only as trustworthy as its verifier; a loop without a real test simply burns time.
that a draft is only “finished” once a named test is satisfied.
A custom wrapper drives the harness through its API or CLI — the most deterministic surface, because the control flow lives in code the firm owns rather than in the model's discretion. A first headless call returns a session identifier; subsequent calls resume that session, preserving full context across turns, so an external program can hold a matter open, insert hard human gates between phases, and bound iteration. The model supplies capability; the wrapper supplies the guarantee that nothing leaves the firm unreviewed.
"""Deterministic envelope around the harness for a NSW matter. The firm — not the model — owns the control flow: it opens a session, holds it open by session id across turns, and makes the "serve" step structurally unreachable without a supervising principal's sign-off. """ import json, subprocess def run(prompt, session=None): """Run one turn of the harness; return its parsed JSON result. :param prompt: the operator instruction for this turn. :param session: a session id to resume, or None to open a new matter thread. :returns: the result dict, including `session_id` for continuation. """ cmd = ["claude", "-p", prompt, "--output-format", "json", "--max-turns", "8"] if session: cmd += ["--resume", session] return json.loads(subprocess.run(cmd, capture_output=True, text=True).stdout) draft = run("Draft the advice. Do not serve or file anything.") sid = draft["session_id"] # The advice cannot be served without a principal approving it first. if principal_signs_off(draft["result"]): run("Finalise and send the approved advice to the client.", session=sid)
what reaches a client or a court, and who signed off on it.
05 / THE BILLING QUESTION
We will be straight about this, because it matters to the relationship. A determinism harness does not price a matter. Once AI compresses the hours that the billable hourLegal-firm termThe traditional fee model: charging for time spent, recorded in six-minute units (one tenth of an hour). quietly used to carry the firm's risk, pricing that risk deliberately and up front is a commercial decision that belongs to the partners, not to a piece of software.
What the harness does is supply the missing substrate for that decision. Because every surface is code, the firm gets a guaranteed, auditable record of which controls ran, what was verified, and what risk was discharged on a given matter. That converts “hours spent” into “value delivered and risk retired” — a defensible basis for value- or risk-based pricing, and a reproducible footing for the s 174 estimate itself. We give you the evidence on which a new billing methodology can stand; the methodology remains yours.
06 / COMPOSITION
These surfaces are not alternatives; they stack. A command routes into the firm's procedure, a script performs the calculation that must come out identically, a hook verifies the result and re-drives on failure, and a wrapper holds the partner's sign-off around the whole exchange. The model keeps its fluency in the gaps between — drafting, summarising, first-pass research — which is exactly where fluency is worth having and where a wrong answer is cheap to catch.
We are not making the AI deterministic. We are deciding, surface by surface, which things in a matter the AI is no longer permitted to get wrong.
It is worth being precise about the claim. Generation stays stochastic; the model still improvises a draft. What the firm gains is guaranteed invariants at the points that carry liability — islands of determinism around a fluent core. That is a more honest promise than “trustworthy AI,” and a far more useful one to a profession that signs its name to the output.
07 / THE EVIDENCE
An argument about determinism should be settled deterministically. So we built the position into something measurable. Using the τ²-benchEvaluation harnessAn open framework for evaluating tool-using conversational agents: a simulated user talks to the agent inside a domain environment, and an evaluator scores the resulting record — database end-state and required disclosures — against the task's criteria. evaluation format we encoded a NSW client-intake domain — a small database of clients, matters and practitioners; tools for the conflict check, identity verification, costs agreements and matter opening; and twelve tasks of deliberately mixed difficulty. Some are a clean run through the procedure. Others turn entirely on a single invariant that must not vary: a capped fee, an expired practising certificate, the order of two steps, a duplicate client record, a costs-disclosure threshold.
Against that benchmark we ran two agents on the same base model (claude-sonnet-4.5). The baseline is a conventional agent — one policy prompt plus the domain tools, free to interpret every step afresh at runtime: the “rule and actor in the same room” of section 01. The harness is the higher-order plugin this paper describes — the identical model, now with skills routing into the firm's procedures, frozen scripts for the calculations, and PreToolUse hooks enforcing provenance and the guardrails before each tool call. Same model, same tasks, same single trial. The only variable is the governance around the loop.
The split is not random. The two agents agree on the salient, hard-to-miss obligations — the happy path, both conflict scenarios, even the conflict check held under pressure to skip it. The baseline's five failures fall exactly where a quiet invariant has to hold and is easy to glide past: refusing a fee type the LPUL prohibits, declining a practitioner whose certificate has lapsed, verifying identity before opening, reusing a client record instead of duplicating it, and recording an accurate costs disclosure against a client's AI-formed expectations. In every one of those five, the baseline narrated a successful intake — fluent, confident, complete — while leaving the matter record non-compliant in precisely the way the lawyer would have carried. The harness, holding those same invariants out-of-band, passed all twelve.
Seven versus twelve, from one model. The five it lost are the five a firm could least afford to lose.
Both runs ship with the repository — full trajectories, every tool call and verdict. To read them side by side, start the leaderboard web UI, open the Visualizer, choose Trajectories, and in the model selector pick claude-sonnet-4-5 (legal baseline) or claude-sonnet-4-5 (legal-harness); selecting any task opens its turn-by-turn record.
git clone https://github.com/pdhoolia/tau2-bench cd tau2-bench/web/leaderboard npm install # first run only npm run dev # serves http://localhost:5173
08 / WHY IT MATTERS HERE
Compliance, supervision and a regulator do not ask for cleverness. They ask whether the firm can show, after the fact, that the rule held. Every surface on this spectrum is code — a command definition, a frozen script, a hook, an orchestration wrapper — and code is versionable, reviewable, signable and attestable. The same move that makes a control tamper-resistant against the model makes it legible to a principal, to an external examiner, and to the Legal Services CommissionerOLSCThe independent NSW regulator that receives and oversees complaints about solicitors and barristers and can take disciplinary action; it co-regulates the profession with the Law Society.Office of the NSW Legal Services Commissioner ↗.
For a boutique firm carrying a growing, interconnected set of obligations — conduct rules, costs disclosure, trust accounting, and now an AML/CTF regime arriving with a fixed commencement date — that shift from hoping the AI behaved to demonstrating the envelope it ran inside is the whole point. It is how the firm sees its own compliance posture, how a partner supervises work they cannot personally read, and how the person who carries the liability is finally equipped to discharge it. Gradual determinism is how a practice adopts AI without surrendering the accountability that makes it a practice.