OpenClaw Ships Evidence-Based Harness to Benchmark GPT-5.4 Against Claude Opus 4.6

OpenClaw has been a comfortable home for frontier tool-using models for a while now, but the experience gap between GPT-5.4 and Claude Opus 4.6 kept showing up in the same awkward places. The new OpenAI flagship was perfectly capable of writing a plan, reading a file, or recognizing an image — and then stalling before actually doing the work. Users noticed. Maintainers noticed. Nobody could prove it.

The latest OpenClaw update ships the first part of the answer: a structured parity program split into four reviewable slices that together close the gap in runtime behavior and — this is the headline — produce a machine-readable verdict on whether GPT-5.4 and Opus 4.6 are actually at parity on the same shared scenarios. The days of debating model behavior with screenshots and vibes are officially over for anyone running OpenClaw.

Four Problems, Four Slices

The parity effort calls out four specific ways GPT-5.4 and Codex-style models were underperforming on OpenClaw. They could stop after planning instead of actually executing. They could misuse strict OpenAI and Codex tool schemas in subtle ways. They could ask for elevated permissions even when full access was impossible in the current runtime. And they could lose the state of long-running tasks when a session replayed or compacted under pressure.

Each of those problems gets its own dedicated slice of work. The first introduces a strict-agentic execution contract for embedded runs, which refuses to treat “I will now do the thing” as completion. The second slice teaches OpenClaw to be brutally honest about why a provider call failed — missing scope, refreshed auth that still bounced, HTML 403 responses, proxy drama, DNS hiccups, timeouts, and blocked full-access modes all get explicit surfaces. The third slice hardens OpenAI and Codex tool-schema compatibility and finally surfaces replay and liveness state so paused, blocked, and abandoned tasks stop disappearing into generic failure messages.

The fourth slice is what makes the rest of it checkable: a first-wave QA-lab parity pack that runs both models through exactly the same five scenarios and compares the outcomes as shared evidence.

What Ships When You Run the Harness

The flow is deliberately boring, which is the point. Operators run the QA suite under GPT-5.4 and then again under Opus 4.6. Each run drops a qa-suite-summary.json into an artifacts directory. A single subcommand — pnpm openclaw qa parity-report — then takes the candidate summary, the baseline summary, and a repo root, and writes three artifacts back out: a human-readable Markdown report, a machine-readable JSON summary, and an explicit pass or fail verdict.

That verdict is the thing CI systems and release captains can actually build on. It compares aggregate metrics that actually matter for agentic work: completion rate, unintended-stop rate, valid-tool-call rate, and a fake-success count that flags suspicious “technically passed” outcomes where the model claimed victory without doing the work. If the candidate model ran fewer scenarios than the baseline, or regressed on any of those aggregates, the gate refuses to call it parity. There is no mechanism for declaring parity against a thinner scenario set.

“Pass means GPT-5.4 covered the same scenarios as Opus 4.6 and did not regress on the agreed aggregate metrics. Fail means at least one hard gate tripped: weaker completion, worse unintended stops, weaker valid tool use, any fake-success case, or mismatched scenario coverage.”

The Scenarios Are Deliberately Unglamorous

The five scenarios in the first wave are not benchmarks in the leaderboard sense. They are the specific failure modes that made GPT-5.4 feel unreliable in real OpenClaw sessions, rebuilt as deterministic tests. Each one has a defined “good behavior” signal and a defined failure signal, and the harness refuses to produce a verdict unless all five are exercised.

approval-turn-tool-followthrough is the short “ok do it” test. The good result is that the model starts the first concrete tool action immediately. The failure signal is a plan-only follow-up with no tool activity, or a blocked turn that cites no real blocker. model-switch-tool-continuity checks what happens when the runtime swaps models mid-task — a healthy response preserves task context and keeps working; a failure is a reset into commentary or a stop immediately after the switch.

source-docs-discovery-report tests whether the model can read source and documentation, synthesize findings, and keep acting instead of producing a thin summary and stopping. image-understanding-attachment does the same thing for attachment-driven tasks — the model has to interpret the image, connect it to tools, and continue the task, not narrate vaguely and stall.

The most interesting one is compaction-retry-mutating-tool. It performs a real mutating write and then puts the session under compaction pressure to see whether replay-unsafety stays explicit through the ordeal. The failure signal is a mutating write that happens but whose replay safety is implied, missing, or contradictory after the fact. In practice, this is the test that catches the silent corruption bugs nobody wants to find in production.

The Release Gate

The new program is also notable for what it refuses to accept as parity evidence. Shared CI noise outside the harness is explicitly not a parity result — if CI gets flaky for reasons unrelated to the harness, operators are told to wait for a clean merged-runtime execution rather than infer anything from branch-era logs. And the release claim is split across two independent layers: the parity pack proves same-scenario GPT-5.4 vs Opus 4.6 behavior in QA-lab, while a separate deterministic truthfulness suite proves that the auth, proxy, DNS, and elevated-permission plumbing is telling the truth. Both have to be green for a model to be called at parity.

That separation is a small piece of release discipline that pays off later. It means a regression in network-layer truthfulness cannot hide inside a passing parity run, and a regression in agentic behavior cannot be papered over with clean network tests. Each kind of failure has its own hard gate.

Before

GPT-5.4 could stop after a reasonable plan without taking the next tool step
Strict tool schemas rejected parameter-free or OpenAI/Codex-shaped tools in confusing ways
Elevated-permission guidance could be vague or wrong in blocked runtimes
Replay and compaction failures felt like tasks silently disappeared
“GPT-5.4 feels worse than Opus” was mostly anecdotal

After

Plan-only turns turn into either action or an explicit blocked state
Provider-owned tool registration and invocation are predictable again
Runtime and permission hints match what the runtime can actually do
Paused, blocked, abandoned, and replay-invalid outcomes all surface explicitly
Same scenario pack, same metrics, hard pass/fail gate

Who Should Care

The strict-agentic contract is recommended for anyone running GPT-5.4 or Codex-family models as a primary runtime where the agent is expected to act immediately once the next step is obvious. It is explicitly not recommended for people who prefer the existing looser behavior or for prompt-testing workflows where enforcement would get in the way. The other three slices — truthfulness, tool-compat, and replay/liveness — apply across the board.

For anyone maintaining their own OpenClaw fleet or evaluating which frontier model to pin as the default on their DeployClaw-hosted instance, the real win is that the answer to “is GPT-5.4 good enough yet” is no longer a matter of vibes. It is a JSON file, a gate result, and a Markdown report. If the runtime team ever wants to claim parity, the harness will now make them show their work.

The stated goal is not to make GPT-5.4 imitate Opus — the two models have genuinely different strengths, and the parity program is explicit about that. The goal is to give GPT-5.4 a runtime contract that rewards real progress, supplies cleaner tool and permission semantics, and turns failure modes into explicit machine- and human-readable states. The user experience changes from “the model had a good plan but stopped” to “the model either acted, or OpenClaw surfaced the exact reason it could not.”

That is a much better place to argue from.

OpenClaw Ships Evidence-Based Harness to Benchmark GPT-5.4 Against Claude Opus 4.6

First-Wave Parity Pack