qa: salvage GPT-5.4 parity proof slice (#65664)

* test(qa): gate parity prose scenarios on real tool calls

Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no
fake progress / fake tool completion') for the two first/second-wave
parity scenarios that can currently pass with a prose-only reply.

Background: the scenario framework already exposes tool-call assertions
via /debug/requests on the mock server (see approval-turn-tool-followthrough
for the pattern). Most parity scenarios use this seam to require a specific
plannedToolName, but source-docs-discovery-report and subagent-handoff
only checked the assistant's prose text, which means a model could fabricate:

- a Worked / Failed / Blocked / Follow-up report without ever calling
  the read tool on the docs / source files the prompt named
- three labeled 'Delegated task', 'Result', 'Evidence' sections without
  ever calling sessions_spawn to delegate

Both gaps are fake-progress loopholes for the parity gate.

Changes:

- source-docs-discovery-report: require at least one read tool call tied
  to the 'worked, failed, blocked' prompt in /debug/requests. Failure
  message dumps the observed plannedToolName list for debugging.
- subagent-handoff: require at least one sessions_spawn tool call tied
  to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same
  debug-friendly failure message.

Both assertions are gated behind !env.mock so they no-op in live-frontier
mode where the real provider exposes plannedToolName through a different
channel (or not at all).

Not touched: memory-recall is also in the parity pack but its pass path
is legitimately 'read the fact from prior-turn context'. That is a valid
recall strategy, not fake progress, so it is out of scope for this PR.
memory-recall's fake-progress story (no real memory_search call) would
require bigger mock-server changes and belongs in a follow-up that
extends the mock memory pipeline.

Validation:

- pnpm test extensions/qa-lab/src/scenario-catalog.test.ts

Refs #64227

* test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch

Addresses loop-6 review feedback on PR #64681:

1. Copilot / Greptile / codex-connector all flagged that the discovery
   scenario's .includes('worked, failed, blocked') assertion is
   case-sensitive but the real prompt says 'Worked, Failed, Blocked...',
   so the mock-mode assertion never matches. Fix: lowercase-normalize
   allInputText before the contains check.
2. Greptile P2: the expr and message.expr each called fetchJson
   separately, incurring two round-trips to /debug/requests. Fix: hoist
   the fetch to a set step (discoveryDebugRequests / subagentDebugRequests)
   and reuse the snapshot.
3. Copilot: the subagent-handoff assertion scanned the entire request
   log and matched the first request with 'delegate' in its input text,
   which could false-pass on a stale prior scenario. Fix: reverse the
   array and take the most recent matching request instead.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests

Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I
used on pass 1 usually lands on the FOLLOW-UP request after the mock
runs sessions_spawn, not the pre-tool planning request that actually
has plannedToolName === 'sessions_spawn'. The mock only plans that tool
on requests with !toolOutput (mock-openai-server.ts:662), so the
post-tool request has plannedToolName unset and the assertion fails
even when the handoff succeeded.

Fix: switch the assertion back to a forward .some() match but add a
!request.toolOutput filter so the match is pinned to the pre-tool
planning phase. The case-insensitive regex, the fetchJson dedupe, and
the failure-message diagnostic from pass 1 are unchanged.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): pin subagent-handoff tool-call assertion to scenario prompt

Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix
filtered to pre-tool requests but still used a broad
`/delegate|subagent handoff/i` regex. The `subagent-fanout-synthesis`
scenario runs BEFORE `subagent-handoff` in catalog order (scenarios
are sorted by path), and the fanout prompt reads
'Subagent fanout synthesis check: delegate exactly two bounded
subagents sequentially' — which contains 'delegate' and also plans
sessions_spawn pre-tool. That produces a cross-scenario false pass
where the fanout's earlier sessions_spawn request satisfies the
handoff assertion even when the handoff run never delegates.

Fix: tighten the input-text match from `/delegate|subagent handoff/i`
to `/delegate one bounded qa task/i`, which is the exact scenario-
unique substring from the `subagent-handoff` config.prompt. That
pins the assertion to this scenario's request window and closes the
cross-scenario false positive.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): align parity assertion comments with actual filter logic

Addresses two loop-7 Copilot findings on PR #64681:

1. source-docs-discovery-report.md: the explanatory comment said the
   debug request log was 'lowercased for case-insensitive matching',
   but the code actually lowercases each request's allInputText inline
   inside the .some() predicate, not the discoveryDebugRequests
   snapshot. Rewrite the comment to describe the inline-lowercase
   pattern so a future reader matches the code they see.

2. subagent-handoff.md: the comment said the assertion 'must be
   pinned to THIS scenario's request window' but the implementation
   actually relies on matching a scenario-unique prompt substring
   (/delegate one bounded qa task/i), not a request-window. Rewrite
   the comment to describe the substring pinning and keep the
   pre-tool filter rationale intact.

No runtime change; comment-only fix to keep reviewer expectations
aligned with the actual assertion shape.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios

* Guard mock-only image parity assertions

* Expand agentic parity second wave

* test(qa): pad parity suspicious-pass isolation to second wave

* qa-lab: parametrize parity report title and drop stale first-wave comment

Addresses two loop-7 Copilot findings on PR #64662:

1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a
   template string that interpolates candidateLabel and baselineLabel, so
   any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate
   title in saved reports. Default CLI flags still produce
   openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair.

2. Stale 'declared first-wave parity scenarios' comment in
   scopeSummaryToParityPack: the parity pack is now the ten-scenario
   first-wave+second-wave set (PR D + PR E). Comment updated to drop
   the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS
   constant the scope is filtering against.

New regression: 'parametrizes the markdown header from the comparison
labels' — asserts that non-default labels (openai/gpt-5.4-alt vs
openai/gpt-5.4) render in the H1.

Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts
(13/13 pass).

Refs #64227

* qa-lab: fail parity gate on required scenario failures regardless of baseline parity

* test(qa): update readable-report test to cover all 10 parity scenarios

* qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels

* Tighten parity label and scenario checks

* fix: tighten parity label provenance checks

* fix: scope parity tool-call metrics to tool lanes

* Fix parity report label and fake-success checks

* fix(qa): tighten parity report edge cases

* qa-lab: add Anthropic /v1/messages mock route for parity baseline

Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity
completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats
Opus 4.6 on the agreed metrics').

Background: the parity gate needs two comparable scenario runs - one
against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the
aggregate metrics and verdict in PR D (#64441) can be computed. Today the
qa-lab mock server only implements /v1/responses, so the baseline run
against Claude Opus 4.6 requires a real Anthropic API key. That makes the
gate impossible to prove end-to-end from a local worktree and means the
CI story is always 'two real providers + quota + keys'.

This PR adds a /v1/messages Anthropic-compatible route to the existing
mock OpenAI server. The route is a thin adapter that:

- Parses Anthropic Messages API request shapes (system as string or
  [{type:text,text}], messages with string or block content, text and
  tool_result and tool_use and image blocks)
- Translates them into the ResponsesInputItem[] shape the existing shared
  scenario dispatcher (buildResponsesPayload) already understands
- Calls the shared dispatcher so both the OpenAI and Anthropic lanes run
  through the exact same scenario prompt-matching logic (same subagent
  fanout state machine, same extractRememberedFact helper, same
  '/debug/requests' telemetry)
- Converts the resulting OpenAI-format events back into an Anthropic
  message response with text and tool_use content blocks and a correct
  stop_reason (tool_use vs end_turn)

Non-streaming only: the QA suite runner falls back to non-streaming mock
mode so real Anthropic SSE isn't necessary for the parity baseline.

Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline
model-list probes from the suite runner resolve without extra config.

Tests added:

- advertises Anthropic claude-opus-4-6 baseline model on /v1/models
- dispatches an Anthropic /v1/messages read tool call for source discovery
  prompts (tool_use stop_reason, correct input path, /debug/requests
  records plannedToolName=read)
- dispatches Anthropic /v1/messages tool_result follow-ups through the
  shared scenario logic (subagent-handoff two-stage flow: tool_use -
  tool_result - 'Delegated task / Evidence' prose summary)

Local validation:

- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass)
- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass)

Refs #64227
Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper
by giving the baseline lane a local-only mock path.

* qa-lab: fix Anthropic tool_result ordering in messages adapter

Addresses the loop-6 Copilot / Greptile finding on PR #64685: in
`convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were
pushed to `items` inside the per-block loop while the surrounding
user/assistant message was only pushed after the loop finished. That
reordered the function_call_output BEFORE its parent user message
whenever a user turn mixed `tool_result` with fresh text/image blocks,
which broke `extractToolOutput` (it scans AFTER the last user-role
index; function_call_output placed BEFORE that index is invisible to it)
and made the downstream scenario dispatcher behave as if no tool output
had been returned on mixed-content turns.

Fix: buffer `tool_result` and `tool_use` blocks in local arrays during
the per-block loop, push the parent role message first (when it has any
text/image pieces), then push the accumulated function_call /
function_call_output items in original order. tool_result-only user
turns still omit the parent message as before, so the non-mixed
subagent-fanout-synthesis two-stage flow that already worked keeps
working.

Regression added:

- `places tool_result after the parent user message even in mixed-content
  turns` — sends a user turn that mixes a `tool_result` block with a
  trailing fresh text block, then inspects `/debug/last-request` to
  assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found
  the function_call_output AFTER the last user index) and
  `prompt === 'Keep going with the fanout.'` (extractLastUserText picked
  up the trailing fresh text).

Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts
(19/19 pass).

Refs #64227

* qa-lab: reject Anthropic streaming and empty model in messages mock

* qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider

* Handle invalid Anthropic mock JSON

* fix: wire mock parity providers by model ref

* fix(qa): support Anthropic message streaming in mock parity lane

* qa-lab: record provider/model/mode in qa-suite-summary.json

Closes the 'summary cannot be label-verified' half of criterion 5 on the
GPT-5.4 parity completion gate in #64227.

Background: the parity gate in #64441 compares two qa-suite-summary.json
files and trusts whatever candidateLabel / baselineLabel the caller
passes. Today the summary JSON only contains { scenarios, counts }, so
nothing in the summary records which provider/model the run actually
used. If a maintainer swaps candidate and baseline summary paths in a
parity-report call, the verdict is silently mislabeled and nobody can
retroactively verify which run produced which summary.

Changes:

- Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt,
  providerMode, primaryModel (+ provider and model splits),
  alternateModel (+ provider and model splits), fastMode, concurrency,
  scenarioIds (when explicitly filtered).
- Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary
  JSON shape is unit-testable and the parity gate (and any future parity
  wrapper) can import the exact same type rather than reverse-engineering
  the JSON shape at runtime.
- Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so
  --scenario-ids flags are recorded in the summary.

Unit tests added (src/suite.summary-json.test.ts, 5 cases):

- records provider/model/mode so parity gates can verify labels
- includes scenarioIds in run metadata when provided
- records an Anthropic baseline lane cleanly for parity runs
- leaves split fields null when a model ref is malformed
- keeps scenarios and counts alongside the run metadata

This is additive: existing callers of qa-suite-summary.json continue to
see the same { scenarios, counts } shape, just with an extra run field.
No existing consumers of the JSON need to change.

The follow-up 'qa parity run' CLI wrapper (run the parity pack twice
against candidate + baseline, emit two labeled summaries in one command)
stacks cleanly on top of this change and will land as a separate PR
once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand
directly.

Local validation:

- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass)
- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass)

Refs #64227
Unblocks the final parity run for #64441 / #64662 by making summaries
self-describing.

* qa-lab: strengthen qa-suite-summary builder types and empty-array semantics

Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689
(re-opened as #64789):

1. P2 codex + Copilot: empty `scenarioIds` array was serialized as
   `[]` because of a truthiness check. The CLI passes an empty array
   when --scenario is omitted, so full-suite runs would incorrectly
   record an explicit empty selection. Fix: switch to a
   `length > 0` check so '[] or undefined' both encode as `null`
   in the summary run metadata.

2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate
   consumers but its return type was `Record<string, unknown>`, which
   defeated the point of exporting it. Fix: introduce a concrete
   `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and
   make the builder return it. Downstream code (parity gate, parity
   run wrapper) can now import the type and keep consumers
   type-checked.

3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the
   `'mock-openai' | 'live-frontier'` string union even though
   `QaProviderMode` is already imported from model-selection.ts. Fix:
   reuse `QaProviderMode` so provider-mode additions flow through
   both types at once.

4. Copilot: test fixtures omitted `steps` from the fake scenario
   results, creating shape drift with the real suite scenario-result
   shape. Fix: pad the test fixtures with `steps: []` and tighten the
   scenarioIds assertion to read `json.run.scenarioIds` directly (the
   new concrete return type makes the type-cast unnecessary).

New regression: `treats an empty scenarioIds array as unspecified
(no filter)` — passes `scenarioIds: []` and asserts the summary
records `scenarioIds: null`.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: record executed scenarioIds in summary run metadata

Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689):
`run.scenarioIds` was copied from the raw `params.scenarioIds`
caller input, but `runQaSuite` normalizes that input through
`selectQaSuiteScenarios` which dedupes via `Set` and reorders the
selection to catalog order. When callers repeat --scenario ids or
pass them in non-catalog order, the summary metadata drifted from
the scenarios actually executed, which can make parity/report
tooling treat equivalent runs as different or trust inaccurate
provenance.

Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass
`selectedCatalogScenarios.map(scenario => scenario.id)` instead of
`params?.scenarioIds`, so the summary records the post-selection
executed list. This also covers the full-suite case automatically
(the executed list is the full lane-filtered catalog), giving parity
consumers a stable record of exactly which scenarios landed in the
run regardless of how the caller phrased the request.

buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2
semantics are preserved so the public helper still treats an empty
array as 'unspecified' for any future caller that legitimately passes
one.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: preserve null scenarioIds for unfiltered suite runs

Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix
always passed `selectedCatalogScenarios.map(...)` to
writeQaSuiteArtifacts, which made unfiltered full-suite runs
indistinguishable from an explicit all-scenarios selection in the
summary metadata. The 'unfiltered → null' semantic (documented in
the buildQaSuiteSummaryJson JSDoc and exercised by the
"treats an empty scenarioIds array as unspecified" regression) was
lost.

Fix: both writeQaSuiteArtifacts call sites now condition on the
caller's original `params.scenarioIds`. When the caller passed an
explicit non-empty filter, record the post-selection executed list
(pass-3 behavior, preserving Set-dedupe + catalog-order
normalization). When the caller passed undefined or an empty array,
pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's
length-check serializes null (pass-2 behavior, preserving unfiltered
semantics).

This keeps both codex-connector findings satisfied simultaneously:
- explicit --scenario filter reorders/dedupes through the executed
  list, not the raw caller input
- unfiltered full-suite run records null, not a full catalog dump
  that would shadow "explicit all-scenarios" selections

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type

* qa-lab: stage mock auth profiles so the parity gate runs without real credentials

* fix(qa): clean up mock auth staging follow-ups

* ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock

* ci: use supported parity gate runner label

* ci: watch gateway changes in parity gate

* docs: pin parity runbook alternate models

* fix(ci): watch qa-channel parity inputs

* qa: roll up parity proof closeout

* qa: harden mock parity review fixes

* qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection

* qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch

* qa-lab: clean up positive-tone comment + fix stale test expectations

* qa: pin workflow Node version to 22.14.0 + fix stale label-match wording

* qa-lab: refresh mock provider routing expectation

* docs: drop stale parity rollup rewrite from proof slice

* qa: run parity gate against mock lane

* deps: sync qa-lab lockfile

* build: refresh a2ui bundle hash

* ci: widen parity gate triggers

---------

Co-authored-by: Eva <eva@100yen.org>
This commit is contained in:
pashpashpash
2026-04-12 21:01:54 -07:00
committed by GitHub
parent 3d07dfbb65
commit b13844732e
23 changed files with 3228 additions and 255 deletions

93
.github/workflows/parity-gate.yml vendored Normal file
View File

@@ -0,0 +1,93 @@
name: Parity gate
on:
pull_request:
types: [opened, reopened, synchronize, ready_for_review]
paths:
- "extensions/qa-lab/**"
- "extensions/qa-channel/**"
- "extensions/openai/**"
- "qa/scenarios/**"
- "src/agents/**"
- "src/context-engine/**"
- "src/gateway/**"
- "src/media/**"
- ".github/workflows/parity-gate.yml"
permissions:
contents: read
concurrency:
group: parity-gate-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
parity-gate:
name: Run the GPT-5.4 / Opus 4.6 parity gate against the qa-lab mock
if: ${{ github.event.pull_request.draft != true }}
runs-on: blacksmith-8vcpu-ubuntu-2404
timeout-minutes: 20
env:
# Fence the gate off from any real provider credentials. The qa-lab
# mock server + auth staging (PR N) should be enough to produce a
# meaningful verdict without touching a real API. If any of these
# leak into the job env, fail hard instead of silently running
# against a live provider and burning real budget.
OPENAI_API_KEY: ""
ANTHROPIC_API_KEY: ""
OPENCLAW_LIVE_OPENAI_KEY: ""
OPENCLAW_LIVE_ANTHROPIC_KEY: ""
OPENCLAW_LIVE_GEMINI_KEY: ""
OPENCLAW_LIVE_SETUP_TOKEN_VALUE: ""
steps:
- name: Checkout PR
uses: actions/checkout@v4
- name: Install pnpm
uses: pnpm/action-setup@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: "22.14.0"
cache: "pnpm"
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: Run GPT-5.4 lane
run: |
pnpm openclaw qa suite \
--provider-mode mock-openai \
--parity-pack agentic \
--model openai/gpt-5.4 \
--alt-model openai/gpt-5.4-alt \
--output-dir .artifacts/qa-e2e/gpt54
- name: Run Opus 4.6 lane
run: |
pnpm openclaw qa suite \
--provider-mode mock-openai \
--parity-pack agentic \
--model anthropic/claude-opus-4-6 \
--alt-model anthropic/claude-sonnet-4-6 \
--output-dir .artifacts/qa-e2e/opus46
- name: Generate parity report
run: |
pnpm openclaw qa parity-report \
--repo-root . \
--candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
--baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
--candidate-label openai/gpt-5.4 \
--baseline-label anthropic/claude-opus-4-6 \
--output-dir .artifacts/qa-e2e/parity
- name: Upload parity artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: parity-gate-${{ github.event.pull_request.number || github.sha }}
path: .artifacts/qa-e2e/
retention-days: 14
if-no-files-found: warn

View File

@@ -2,16 +2,42 @@ import { describe, expect, it } from "vitest";
import {
buildQaAgenticParityComparison,
computeQaAgenticParityMetrics,
QaParityLabelMismatchError,
renderQaAgenticParityMarkdownReport,
type QaParityReportScenario,
type QaParitySuiteSummary,
} from "./agentic-parity-report.js";
const FULL_PARITY_PASS_SCENARIOS: QaParityReportScenario[] = [
{ name: "Approval turn tool followthrough", status: "pass" as const },
{ name: "Compaction retry after mutating tool", status: "pass" as const },
{ name: "Model switch with tool continuity", status: "pass" as const },
{ name: "Source and docs discovery report", status: "pass" as const },
{ name: "Image understanding from attachment", status: "pass" as const },
{ name: "Subagent handoff", status: "pass" as const },
{ name: "Subagent fanout synthesis", status: "pass" as const },
{ name: "Memory recall after context switch", status: "pass" as const },
{ name: "Thread memory isolation", status: "pass" as const },
{ name: "Config restart capability flip", status: "pass" as const },
{ name: "Instruction followthrough repo contract", status: "pass" as const },
];
function withScenarioOverride(name: string, override: Partial<QaParityReportScenario>) {
return FULL_PARITY_PASS_SCENARIOS.map((scenario) =>
scenario.name === name ? { ...scenario, ...override } : scenario,
);
}
describe("qa agentic parity report", () => {
it("computes first-wave parity metrics from suite summaries", () => {
const summary: QaParitySuiteSummary = {
scenarios: [
{ name: "Scenario A", status: "pass" },
{ name: "Scenario B", status: "fail", details: "incomplete turn detected" },
{ name: "Approval turn tool followthrough", status: "pass" },
{
name: "Compaction retry after mutating tool",
status: "fail",
details: "incomplete turn detected",
},
],
};
@@ -28,6 +54,23 @@ describe("qa agentic parity report", () => {
});
});
it("keeps non-tool scenarios out of the valid-tool-call metric", () => {
const summary: QaParitySuiteSummary = {
scenarios: [
{ name: "Approval turn tool followthrough", status: "pass" },
{ name: "Memory recall after context switch", status: "pass" },
{ name: "Image understanding from attachment", status: "pass" },
],
};
expect(computeQaAgenticParityMetrics(summary)).toMatchObject({
totalScenarios: 3,
passedScenarios: 3,
validToolCallCount: 1,
validToolCallRate: 1,
});
});
it("fails the parity gate when the candidate regresses against baseline", () => {
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
@@ -207,33 +250,70 @@ describe("qa agentic parity report", () => {
);
});
it("fails the parity gate when a required parity scenario fails on both sides", () => {
// Regression for the loop-7 Codex-connector P1 finding: without this
// check, a required parity scenario that fails on both candidate and
// baseline still produces pass=true because the downstream metric
// comparisons are purely relative (candidate vs baseline). Cover the
// whole parity pack as pass on both sides except the one scenario we
// deliberately fail on both sides, so the assertion can pin the
// isolated gate failure under test.
const scenariosWithBothFail = withScenarioOverride("Approval turn tool followthrough", {
status: "fail",
});
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: { scenarios: scenariosWithBothFail },
baselineSummary: { scenarios: scenariosWithBothFail },
comparedAt: "2026-04-11T00:00:00.000Z",
});
expect(comparison.pass).toBe(false);
expect(comparison.failures).toContain(
"Required parity scenario Approval turn tool followthrough failed: openai/gpt-5.4=fail, anthropic/claude-opus-4-6=fail.",
);
// Metric comparisons are relative, so a same-on-both-sides failure
// must not appear as a relative metric failure. The required-scenario
// failure line is the only thing keeping the gate honest here.
expect(comparison.failures.some((failure) => failure.includes("completion rate"))).toBe(false);
});
it("fails the parity gate when a required parity scenario fails on the candidate only", () => {
// A candidate regression below a passing baseline is already caught
// by the relative completion-rate comparison, but surface it as a
// named required-scenario failure too so operators see a concrete
// scenario name alongside the rate differential.
const candidateWithOneFail = withScenarioOverride("Approval turn tool followthrough", {
status: "fail",
});
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: { scenarios: candidateWithOneFail },
baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
comparedAt: "2026-04-11T00:00:00.000Z",
});
expect(comparison.pass).toBe(false);
expect(comparison.failures).toContain(
"Required parity scenario Approval turn tool followthrough failed: openai/gpt-5.4=fail, anthropic/claude-opus-4-6=pass.",
);
});
it("fails the parity gate when the baseline contains suspicious pass results", () => {
// Cover the full first-wave pack on both sides so the suspicious-pass assertion
// Cover the full second-wave pack on both sides so the suspicious-pass assertion
// below is the isolated gate failure under test (no coverage-gap noise).
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: {
scenarios: [
{ name: "Approval turn tool followthrough", status: "pass" },
{ name: "Compaction retry after mutating tool", status: "pass" },
{ name: "Model switch with tool continuity", status: "pass" },
{ name: "Source and docs discovery report", status: "pass" },
{ name: "Image understanding from attachment", status: "pass" },
],
scenarios: FULL_PARITY_PASS_SCENARIOS,
},
baselineSummary: {
scenarios: [
{
name: "Approval turn tool followthrough",
status: "pass",
details: "timed out before it continued",
},
{ name: "Compaction retry after mutating tool", status: "pass" },
{ name: "Model switch with tool continuity", status: "pass" },
{ name: "Source and docs discovery report", status: "pass" },
{ name: "Image understanding from attachment", status: "pass" },
],
scenarios: withScenarioOverride("Approval turn tool followthrough", {
details: "timed out before it continued",
}),
},
comparedAt: "2026-04-11T00:00:00.000Z",
});
@@ -303,36 +383,333 @@ Follow-up:
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(1);
});
it("renders a readable markdown parity report", () => {
it("does not flag positive-tone prose as fake success (positive-tone detection removed)", () => {
// Positive-tone detection was removed because for passing runs the
// `details` field is the model's prose, which never contains tool-call
// evidence. Criterion 2 is enforced by per-scenario tool-call assertions.
const summary: QaParitySuiteSummary = {
scenarios: [
{
name: "Subagent handoff",
status: "pass",
details: "Successfully completed the delegation. The subagent returned its result.",
},
],
};
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
});
it("does not flag bare 'Done.' prose as fake success", () => {
const summary: QaParitySuiteSummary = {
scenarios: [
{
name: "Approval turn tool followthrough",
status: "pass",
details: "Done.",
},
],
};
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
});
it("does not flag structured status lines that end in `done`", () => {
const summary: QaParitySuiteSummary = {
scenarios: [
{
name: "Compaction retry after mutating tool",
status: "pass",
details: `Confirmed, replay unsafe after write.
compactionCount=0
status=done`,
},
],
};
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
});
it("does not flag positive-tone passes when the scenario shows real tool-call evidence", () => {
// A legitimate tool-mediated pass that happens to include
// "successfully" in its prose must not be flagged. The
// `plannedToolName` evidence (or any of the other tool-call
// evidence patterns) exempts the scenario from positive-tone
// detection. Without this exemption, real tool-backed passes with
// self-congratulatory prose would count as fake successes and break
// the gate.
const summary: QaParitySuiteSummary = {
scenarios: [
{
name: "Source and docs discovery report",
status: "pass",
details:
"Successfully completed the report. plannedToolName=read recorded via /debug/requests.",
},
],
};
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
});
it("only flags failure-tone passes, not positive-tone", () => {
const summary: QaParitySuiteSummary = {
scenarios: [
{
name: "Approval turn tool followthrough",
status: "pass",
details: "Task executed successfully without errors.",
},
{
name: "Subagent handoff",
status: "pass",
details: "Tool call completed, but an error occurred mid-turn.",
},
],
};
// Only the failure-tone scenario ("error occurred") counts.
// The positive-tone one ("successfully") is not flagged.
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(1);
});
it("throws QaParityLabelMismatchError when the candidate run.primaryProvider does not match the label", () => {
// Regression for the gate footgun: if an operator swaps the
// --candidate-summary and --baseline-summary paths, the gate would
// silently produce a reversed verdict. PR L #64789 ships the `run`
// block on every summary so the parity report can verify it against
// the caller-supplied label; this test pins the precondition check.
const parityPassScenarios = [
{ name: "Approval turn tool followthrough", status: "pass" as const },
{ name: "Compaction retry after mutating tool", status: "pass" as const },
{ name: "Model switch with tool continuity", status: "pass" as const },
{ name: "Source and docs discovery report", status: "pass" as const },
{ name: "Image understanding from attachment", status: "pass" as const },
];
expect(() =>
buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: {
scenarios: parityPassScenarios,
run: { primaryProvider: "anthropic", primaryModel: "claude-opus-4-6" },
},
baselineSummary: {
scenarios: parityPassScenarios,
run: { primaryProvider: "anthropic", primaryModel: "claude-opus-4-6" },
},
comparedAt: "2026-04-11T00:00:00.000Z",
}),
).toThrow(QaParityLabelMismatchError);
});
it("throws QaParityLabelMismatchError when the baseline run.primaryProvider does not match the label", () => {
const parityPassScenarios = [
{ name: "Approval turn tool followthrough", status: "pass" as const },
];
expect(() =>
buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: {
scenarios: parityPassScenarios,
run: { primaryProvider: "openai" },
},
baselineSummary: {
scenarios: parityPassScenarios,
run: { primaryProvider: "openai", primaryModel: "gpt-5.4" },
},
comparedAt: "2026-04-11T00:00:00.000Z",
}),
).toThrow(
/baseline summary run\.primaryProvider=openai and run\.primaryModel=gpt-5\.4 do not match --baseline-label/,
);
});
it("accepts matching run.primaryProvider labels without throwing", () => {
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: {
scenarios: [
{ name: "Approval turn tool followthrough", status: "pass" },
{ name: "Compaction retry after mutating tool", status: "pass" },
{ name: "Model switch with tool continuity", status: "pass" },
{ name: "Source and docs discovery report", status: "pass" },
{ name: "Image understanding from attachment", status: "pass" },
],
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "openai",
primaryModel: "openai/gpt-5.4",
primaryModelName: "gpt-5.4",
},
},
baselineSummary: {
scenarios: [
{ name: "Approval turn tool followthrough", status: "pass" },
{ name: "Compaction retry after mutating tool", status: "pass" },
{ name: "Model switch with tool continuity", status: "pass" },
{ name: "Source and docs discovery report", status: "pass" },
{ name: "Image understanding from attachment", status: "pass" },
],
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "anthropic",
primaryModel: "anthropic/claude-opus-4-6",
primaryModelName: "claude-opus-4-6",
},
},
comparedAt: "2026-04-11T00:00:00.000Z",
});
expect(comparison.pass).toBe(true);
});
it("skips run.primaryProvider verification when the summary is missing a run block (legacy summaries)", () => {
// Pre-PR-L summaries don't carry a `run` block. The gate must still
// work against those, trusting the caller-supplied label.
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
comparedAt: "2026-04-11T00:00:00.000Z",
});
expect(comparison.pass).toBe(true);
});
it("skips provider verification for arbitrary display labels when run metadata is present", () => {
const comparison = buildQaAgenticParityComparison({
candidateLabel: "GPT-5.4 candidate",
baselineLabel: "Opus 4.6 baseline",
candidateSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "openai",
primaryModel: "openai/gpt-5.4",
primaryModelName: "gpt-5.4",
},
},
baselineSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "anthropic",
primaryModel: "anthropic/claude-opus-4-6",
primaryModelName: "claude-opus-4-6",
},
},
comparedAt: "2026-04-11T00:00:00.000Z",
});
expect(comparison.pass).toBe(true);
});
it("skips provider verification for mixed-case or decorated display labels", () => {
const comparison = buildQaAgenticParityComparison({
candidateLabel: "Candidate: GPT-5.4",
baselineLabel: "Opus 4.6 / baseline",
candidateSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "openai",
primaryModel: "openai/gpt-5.4",
primaryModelName: "gpt-5.4",
},
},
baselineSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "anthropic",
primaryModel: "anthropic/claude-opus-4-6",
primaryModelName: "claude-opus-4-6",
},
},
comparedAt: "2026-04-11T00:00:00.000Z",
});
expect(comparison.pass).toBe(true);
});
it("throws when a structured label mismatches the recorded model even if the provider matches", () => {
expect(() =>
buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "openai",
primaryModel: "openai/gpt-5.4-alt",
primaryModelName: "gpt-5.4-alt",
},
},
baselineSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "anthropic",
primaryModel: "anthropic/claude-opus-4-6",
primaryModelName: "claude-opus-4-6",
},
},
comparedAt: "2026-04-11T00:00:00.000Z",
}),
).toThrow(
/candidate summary run\.primaryProvider=openai and run\.primaryModel=openai\/gpt-5\.4-alt do not match --candidate-label=openai\/gpt-5\.4/,
);
});
it("accepts colon-delimited structured labels when provider and model both match", () => {
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai:gpt-5.4",
baselineLabel: "anthropic:claude-opus-4-6",
candidateSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "openai",
primaryModel: "openai/gpt-5.4",
primaryModelName: "gpt-5.4",
},
},
baselineSummary: {
scenarios: FULL_PARITY_PASS_SCENARIOS,
run: {
primaryProvider: "anthropic",
primaryModel: "anthropic/claude-opus-4-6",
primaryModelName: "claude-opus-4-6",
},
},
comparedAt: "2026-04-11T00:00:00.000Z",
});
expect(comparison.pass).toBe(true);
});
it("renders a readable markdown parity report", () => {
// Cover the full parity pack on both sides so the pass
// verdict is not disrupted by required-scenario coverage failures
// added by the second-wave expansion.
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4",
baselineLabel: "anthropic/claude-opus-4-6",
candidateSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
comparedAt: "2026-04-11T00:00:00.000Z",
});
const report = renderQaAgenticParityMarkdownReport(comparison);
expect(report).toContain("# OpenClaw GPT-5.4 / Opus 4.6 Agentic Parity Report");
expect(report).toContain(
"# OpenClaw Agentic Parity Report — openai/gpt-5.4 vs anthropic/claude-opus-4-6",
);
expect(report).toContain("| Completion rate | 100.0% | 100.0% |");
expect(report).toContain("### Approval turn tool followthrough");
expect(report).toContain("- Verdict: pass");
});
it("parametrizes the markdown header from the comparison labels", () => {
// Regression for the loop-7 Copilot finding: callers that configure
// non-gpt-5.4 / non-opus labels (for example an internal candidate vs
// another candidate) must see the labels in the rendered H1 instead of
// the hardcoded "GPT-5.4 / Opus 4.6" title that would otherwise confuse
// readers of saved reports.
const comparison = buildQaAgenticParityComparison({
candidateLabel: "openai/gpt-5.4-alt",
baselineLabel: "openai/gpt-5.4",
candidateSummary: { scenarios: [] },
baselineSummary: { scenarios: [] },
comparedAt: "2026-04-11T00:00:00.000Z",
});
const report = renderQaAgenticParityMarkdownReport(comparison);
expect(report).toContain(
"# OpenClaw Agentic Parity Report — openai/gpt-5.4-alt vs openai/gpt-5.4",
);
});
});

View File

@@ -1,4 +1,7 @@
import { QA_AGENTIC_PARITY_SCENARIO_TITLES } from "./agentic-parity.js";
import {
QA_AGENTIC_PARITY_SCENARIO_TITLES,
QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES,
} from "./agentic-parity.js";
export type QaParityReportStep = {
name: string;
@@ -13,6 +16,21 @@ export type QaParityReportScenario = {
steps?: QaParityReportStep[];
};
/**
* Optional self-describing run metadata written by PR L (#64789). Before
* that PR merges, older summaries only have `scenarios` + `counts`; the
* parity report treats a missing `run` block as "unknown provenance" and
* skips the label-match verification for backwards compatibility
* with legacy summaries that predate the run metadata block.
*/
export type QaParityRunBlock = {
primaryProvider?: string;
primaryModel?: string;
primaryModelName?: string;
providerMode?: string;
scenarioIds?: readonly string[] | null;
};
export type QaParitySuiteSummary = {
scenarios: QaParityReportScenario[];
counts?: {
@@ -20,6 +38,8 @@ export type QaParitySuiteSummary = {
passed?: number;
failed?: number;
};
/** Self-describing run metadata — see PR L #64789 for the writer side. */
run?: QaParityRunBlock;
};
export type QaAgenticParityMetrics = {
@@ -64,7 +84,11 @@ const UNINTENDED_STOP_PATTERNS = [
/did not continue/i,
] as const;
const SUSPICIOUS_PASS_PATTERNS = [
// Failure-tone patterns: a passing scenario whose details text matches any
// of these is treated as a "fake success" — the scenario is marked pass but
// the supporting text reveals something went wrong. Adding new patterns here
// widens the net for bad prose that correlates with runtime failure modes.
const SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS = [
/incomplete turn/i,
/\btimed out\b/i,
/\btimeout\b/i,
@@ -76,6 +100,13 @@ const SUSPICIOUS_PASS_PATTERNS = [
/an error was/i,
] as const;
// Positive-tone patterns (e.g. "Successfully completed", "Done.") are NOT
// checked in fakeSuccessCount. For passing runs, `details` is the model's
// outbound prose, which never contains tool-call evidence strings, so a
// tool-call-evidence exemption would false-positive on every legitimate
// pass. Criterion 2 ("no fake progress") is enforced by per-scenario
// `/debug/requests` tool-call assertions in the YAML flows (PR J) instead.
function normalizeScenarioStatus(status: string | undefined): "pass" | "fail" | "skip" {
return status === "pass" || status === "fail" || status === "skip" ? status : "fail";
}
@@ -103,6 +134,9 @@ export function computeQaAgenticParityMetrics(
...scenario,
status: normalizeScenarioStatus(scenario.status),
}));
const toolBackedTitleSet: ReadonlySet<string> = new Set(
QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES,
);
const totalScenarios = summary.counts?.total ?? scenarios.length;
const passedScenarios =
summary.counts?.passed ?? scenarios.filter((scenario) => scenario.status === "pass").length;
@@ -112,16 +146,40 @@ export function computeQaAgenticParityMetrics(
(scenario) =>
scenario.status !== "pass" && scenarioHasPattern(scenario, UNINTENDED_STOP_PATTERNS),
).length;
const fakeSuccessCount = scenarios.filter(
(scenario) =>
scenario.status === "pass" && scenarioHasPattern(scenario, SUSPICIOUS_PASS_PATTERNS),
const fakeSuccessCount = scenarios.filter((scenario) => {
if (scenario.status !== "pass") {
return false;
}
// Failure-tone patterns catch obviously-broken passes regardless of
// whether the scenario shows tool-call evidence — "timed out" under a
// pass is always fake.
if (scenarioHasPattern(scenario, SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS)) {
return true;
}
// Positive-tone patterns (like "Successfully completed") are NOT checked
// here because for passing runs the `details` field is the model's
// outbound prose, which never contains tool-call evidence strings.
// The `scenarioLacksToolCallEvidence` check would return true for ALL
// passes and false-positive on legitimate completions. Criterion 2
// ("no fake tool completion") is instead enforced by the per-scenario
// `/debug/requests` tool-call assertions from the scenario YAML flows.
return false;
}).length;
// Count only the scenarios that are supposed to exercise a real tool,
// subagent, or capability invocation. Memory recall and image-only
// understanding lanes stay in the parity pack, but they should not inflate
// the tool-call metric just by passing.
const toolBackedScenarioCount = scenarios.filter((scenario) =>
toolBackedTitleSet.has(scenario.name),
).length;
const validToolCallCount = scenarios.filter(
(scenario) => toolBackedTitleSet.has(scenario.name) && scenario.status === "pass",
).length;
// First-wave parity scenarios are all tool-mediated tasks, so a passing scenario is our
// verified unit of valid tool-backed execution in this harness.
const validToolCallCount = passedScenarios;
const rate = (value: number) => (totalScenarios > 0 ? value / totalScenarios : 0);
const toolRate = (value: number) =>
toolBackedScenarioCount > 0 ? value / toolBackedScenarioCount : 0;
return {
totalScenarios,
passedScenarios,
@@ -130,7 +188,7 @@ export function computeQaAgenticParityMetrics(
unintendedStopCount,
unintendedStopRate: rate(unintendedStopCount),
validToolCallCount,
validToolCallRate: rate(validToolCallCount),
validToolCallRate: toolRate(validToolCallCount),
fakeSuccessCount,
};
}
@@ -149,14 +207,116 @@ function scopeSummaryToParityPack(
summary: QaParitySuiteSummary,
parityTitleSet: ReadonlySet<string>,
): QaParitySuiteSummary {
// The parity verdict must only consider the declared first-wave parity scenarios.
// Drop `counts` so the metric helper recomputes totals from the filtered scenario
// list instead of inheriting the caller's full-suite counters.
// The parity verdict must only consider the declared parity scenarios
// (the full first-wave + second-wave pack from QA_AGENTIC_PARITY_SCENARIOS).
// Drop `counts` so the metric helper recomputes totals from the filtered
// scenario list instead of inheriting the caller's full-suite counters.
return {
scenarios: summary.scenarios.filter((scenario) => parityTitleSet.has(scenario.name)),
...(summary.run ? { run: summary.run } : {}),
};
}
type StructuredQaParityLabel = {
provider: string;
model: string;
};
/**
* Only treat caller labels as provenance-checked identifiers when they are
* exact lower-case provider/model refs. Human-facing display labels like
* "GPT-5.4 candidate" or "Candidate: GPT-5.4" should render in the report
* without being misread as structured provider ids.
*/
function parseStructuredLabelRef(label: string): StructuredQaParityLabel | null {
const trimmed = label.trim();
if (trimmed.length === 0) {
return null;
}
if (trimmed !== trimmed.toLowerCase()) {
return null;
}
const separatorMatch = /^([a-z0-9][a-z0-9-]*)[/:]([a-z0-9][a-z0-9._-]*)$/.exec(trimmed);
if (!separatorMatch) {
return null;
}
return {
provider: separatorMatch[1] ?? "",
model: separatorMatch[2] ?? "",
};
}
/**
* Verify the `run.primaryProvider` + `run.primaryModel` fields on a summary
* match the caller-supplied label when that label is a structured
* `provider/model` or `provider:model` ref. PR L #64789 ships the `run`
* block; before it lands, older summaries don't have the field and this check
* is a no-op.
*
* Throws `QaParityLabelMismatchError` when the summary reports a different
* provider/model than the caller claimed — this catches the "swapped
* candidate and baseline summary paths" footgun the earlier adversarial
* review flagged. Returns silently when the fields are absent (legacy
* summaries) or when the fields match.
*/
function verifySummaryLabelMatch(params: {
summary: QaParitySuiteSummary;
label: string;
role: "candidate" | "baseline";
}): void {
const runProvider = params.summary.run?.primaryProvider?.trim();
const runModel = params.summary.run?.primaryModel?.trim();
const runModelName = params.summary.run?.primaryModelName?.trim();
if (!runProvider || !runModel) {
return;
}
const labelRef = parseStructuredLabelRef(params.label);
if (!labelRef) {
return;
}
const normalizedRunModel = runModel.toLowerCase();
const normalizedRunModelName = runModelName?.toLowerCase();
const normalizedLabelModel = labelRef.model;
if (
runProvider.toLowerCase() === labelRef.provider &&
(normalizedRunModel === normalizedLabelModel ||
normalizedRunModelName === normalizedLabelModel ||
normalizedRunModel === `${labelRef.provider}/${normalizedLabelModel}`)
) {
return;
}
throw new QaParityLabelMismatchError({
role: params.role,
label: params.label,
runProvider,
runModel,
});
}
export class QaParityLabelMismatchError extends Error {
readonly role: "candidate" | "baseline";
readonly label: string;
readonly runProvider: string;
readonly runModel: string;
constructor(params: {
role: "candidate" | "baseline";
label: string;
runProvider: string;
runModel: string;
}) {
super(
`${params.role} summary run.primaryProvider=${params.runProvider} and run.primaryModel=${params.runModel} do not match --${params.role}-label=${params.label}. ` +
`Check that the --candidate-summary / --baseline-summary paths weren't swapped.`,
);
this.name = "QaParityLabelMismatchError";
this.role = params.role;
this.label = params.label;
this.runProvider = params.runProvider;
this.runModel = params.runModel;
}
}
export function buildQaAgenticParityComparison(params: {
candidateLabel: string;
baselineLabel: string;
@@ -164,6 +324,22 @@ export function buildQaAgenticParityComparison(params: {
baselineSummary: QaParitySuiteSummary;
comparedAt?: string;
}): QaAgenticParityComparison {
// Precondition: verify the `run.primaryProvider` field on each summary
// matches the caller-supplied label (when the `run` block is present).
// Throws `QaParityLabelMismatchError` on mismatch so the release gate
// fails loudly instead of silently producing a reversed verdict when an
// operator swaps the --candidate-summary and --baseline-summary paths.
// Legacy summaries without a `run` block are accepted as-is.
verifySummaryLabelMatch({
summary: params.candidateSummary,
label: params.candidateLabel,
role: "candidate",
});
verifySummaryLabelMatch({
summary: params.baselineSummary,
label: params.baselineLabel,
role: "baseline",
});
const parityTitleSet: ReadonlySet<string> = new Set<string>(QA_AGENTIC_PARITY_SCENARIO_TITLES);
// Rates and fake-success counts are computed from the parity-scoped summaries only,
// so extra non-parity scenarios in the input (for example when a caller feeds a full
@@ -203,7 +379,7 @@ export function buildQaAgenticParityComparison(params: {
});
const failures: string[] = [];
const requiredScenarioCoverage = QA_AGENTIC_PARITY_SCENARIO_TITLES.map((name) => {
const requiredScenarioStatuses = QA_AGENTIC_PARITY_SCENARIO_TITLES.map((name) => {
const candidate = candidateByName.get(name);
const baseline = baselineByName.get(name);
return {
@@ -211,7 +387,8 @@ export function buildQaAgenticParityComparison(params: {
candidateStatus: requiredCoverageStatus(candidate),
baselineStatus: requiredCoverageStatus(baseline),
};
}).filter(
});
const requiredScenarioCoverage = requiredScenarioStatuses.filter(
(scenario) =>
scenario.candidateStatus === "missing" ||
scenario.baselineStatus === "missing" ||
@@ -223,6 +400,26 @@ export function buildQaAgenticParityComparison(params: {
`Missing required parity scenario coverage for ${scenario.name}: ${params.candidateLabel}=${scenario.candidateStatus}, ${params.baselineLabel}=${scenario.baselineStatus}.`,
);
}
// Required parity scenarios that ran on both sides but FAILED also fail
// the gate. Without this check, a run where both models fail the same
// required scenarios still produced pass=true, because the downstream
// metric comparisons are purely relative (candidate vs baseline) and
// the suspicious-pass fake-success check only catches passes that carry
// failure-sounding details. Excluding missing/skip here keeps operator
// output from double-counting the same scenario with two lines.
const requiredScenarioFailures = requiredScenarioStatuses.filter(
(scenario) =>
scenario.candidateStatus !== "missing" &&
scenario.baselineStatus !== "missing" &&
scenario.candidateStatus !== "skip" &&
scenario.baselineStatus !== "skip" &&
(scenario.candidateStatus === "fail" || scenario.baselineStatus === "fail"),
);
for (const scenario of requiredScenarioFailures) {
failures.push(
`Required parity scenario ${scenario.name} failed: ${params.candidateLabel}=${scenario.candidateStatus}, ${params.baselineLabel}=${scenario.baselineStatus}.`,
);
}
// Required parity scenarios are already reported via `requiredScenarioCoverage`
// above; excluding them here keeps the operator-facing failure list from
// double-counting the same missing scenario (one "Missing required parity scenario
@@ -281,8 +478,13 @@ export function buildQaAgenticParityComparison(params: {
}
export function renderQaAgenticParityMarkdownReport(comparison: QaAgenticParityComparison): string {
// Title is parametrized from the candidate / baseline labels so reports
// for any candidate/baseline pair (not only gpt-5.4 vs opus 4.6) render
// with an accurate header. The default CLI labels are still
// openai/gpt-5.4 vs anthropic/claude-opus-4-6, but the helper works for
// any parity comparison a caller configures.
const lines = [
"# OpenClaw GPT-5.4 / Opus 4.6 Agentic Parity Report",
`# OpenClaw Agentic Parity Report${comparison.candidateLabel} vs ${comparison.baselineLabel}`,
"",
`- Compared at: ${comparison.comparedAt}`,
`- Candidate: ${comparison.candidateLabel}`,

View File

@@ -4,22 +4,57 @@ export const QA_AGENTIC_PARITY_SCENARIOS = [
{
id: "approval-turn-tool-followthrough",
title: "Approval turn tool followthrough",
countsTowardValidToolCallRate: true,
},
{
id: "model-switch-tool-continuity",
title: "Model switch with tool continuity",
countsTowardValidToolCallRate: true,
},
{
id: "source-docs-discovery-report",
title: "Source and docs discovery report",
countsTowardValidToolCallRate: true,
},
{
id: "image-understanding-attachment",
title: "Image understanding from attachment",
countsTowardValidToolCallRate: false,
},
{
id: "compaction-retry-mutating-tool",
title: "Compaction retry after mutating tool",
countsTowardValidToolCallRate: true,
},
{
id: "subagent-handoff",
title: "Subagent handoff",
countsTowardValidToolCallRate: true,
},
{
id: "subagent-fanout-synthesis",
title: "Subagent fanout synthesis",
countsTowardValidToolCallRate: true,
},
{
id: "memory-recall",
title: "Memory recall after context switch",
countsTowardValidToolCallRate: false,
},
{
id: "thread-memory-isolation",
title: "Thread memory isolation",
countsTowardValidToolCallRate: true,
},
{
id: "config-restart-capability-flip",
title: "Config restart capability flip",
countsTowardValidToolCallRate: true,
},
{
id: "instruction-followthrough-repo-contract",
title: "Instruction followthrough repo contract",
countsTowardValidToolCallRate: true,
},
] as const;
@@ -27,6 +62,9 @@ export const QA_AGENTIC_PARITY_SCENARIO_IDS = QA_AGENTIC_PARITY_SCENARIOS.map(({
export const QA_AGENTIC_PARITY_SCENARIO_TITLES = QA_AGENTIC_PARITY_SCENARIOS.map(
({ title }) => title,
);
export const QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES = QA_AGENTIC_PARITY_SCENARIOS.filter(
({ countsTowardValidToolCallRate }) => countsTowardValidToolCallRate,
).map(({ title }) => title);
export function resolveQaParityPackScenarioIds(params: {
parityPack?: string;

View File

@@ -338,6 +338,12 @@ describe("qa cli runtime", () => {
"source-docs-discovery-report",
"image-understanding-attachment",
"compaction-retry-mutating-tool",
"subagent-handoff",
"subagent-fanout-synthesis",
"memory-recall",
"thread-memory-isolation",
"config-restart-capability-flip",
"instruction-followthrough-repo-contract",
],
}),
);
@@ -566,6 +572,39 @@ describe("qa cli runtime", () => {
);
});
it("passes provider-qualified mock parity suite selection through to the host runner", async () => {
await runQaSuiteCommand({
repoRoot: "/tmp/openclaw-repo",
providerMode: "mock-openai",
parityPack: "agentic",
primaryModel: "openai/gpt-5.4",
alternateModel: "anthropic/claude-opus-4-6",
});
expect(runQaSuiteFromRuntime).toHaveBeenCalledWith({
repoRoot: path.resolve("/tmp/openclaw-repo"),
outputDir: undefined,
transportId: "qa-channel",
providerMode: "mock-openai",
primaryModel: "openai/gpt-5.4",
alternateModel: "anthropic/claude-opus-4-6",
fastMode: undefined,
scenarioIds: [
"approval-turn-tool-followthrough",
"model-switch-tool-continuity",
"source-docs-discovery-report",
"image-understanding-attachment",
"compaction-retry-mutating-tool",
"subagent-handoff",
"subagent-fanout-synthesis",
"memory-recall",
"thread-memory-isolation",
"config-restart-capability-flip",
"instruction-followthrough-repo-contract",
],
});
});
it("rejects multipass-only suite flags on the host runner", async () => {
await expect(
runQaSuiteCommand({

View File

@@ -64,6 +64,11 @@ describe("buildQaRuntimeEnv", () => {
expect(env.GEMINI_API_KEY).toBe("gemini-live");
});
it("defaults gateway-child provider mode to mock-openai when omitted", () => {
expect(__testing.resolveQaGatewayChildProviderMode(undefined)).toBe("mock-openai");
expect(__testing.resolveQaGatewayChildProviderMode("live-frontier")).toBe("live-frontier");
});
it("keeps explicit provider env vars over live aliases", () => {
const env = buildQaRuntimeEnv({
...createParams({
@@ -299,6 +304,88 @@ describe("buildQaRuntimeEnv", () => {
});
});
it("stages placeholder mock auth profiles per agent dir so mock-openai runs can resolve credentials", async () => {
const stateDir = await mkdtemp(path.join(os.tmpdir(), "qa-mock-auth-"));
cleanups.push(async () => {
await rm(stateDir, { recursive: true, force: true });
});
const cfg = await __testing.stageQaMockAuthProfiles({
cfg: {},
stateDir,
});
// Config side: both providers should have a profile entry with mode
// "api_key" so the runtime picks up the staging without any further
// config mutation.
expect(cfg.auth?.profiles?.["qa-mock-openai"]).toMatchObject({
provider: "openai",
mode: "api_key",
displayName: "QA mock openai credential",
});
expect(cfg.auth?.profiles?.["qa-mock-anthropic"]).toMatchObject({
provider: "anthropic",
mode: "api_key",
displayName: "QA mock anthropic credential",
});
// Store side: each agent dir should have its own auth-profiles.json
// containing the placeholder credential for each staged provider. This
// is what the scenario runner actually reads when it resolves auth
// before calling the mock.
for (const agentId of ["main", "qa"]) {
const storeRaw = await readFile(
path.join(stateDir, "agents", agentId, "agent", "auth-profiles.json"),
"utf8",
);
const parsed = JSON.parse(storeRaw) as {
profiles: Record<string, { type: string; provider: string; key: string }>;
};
expect(parsed.profiles["qa-mock-openai"]).toMatchObject({
type: "api_key",
provider: "openai",
key: "qa-mock-not-a-real-key",
});
expect(parsed.profiles["qa-mock-anthropic"]).toMatchObject({
type: "api_key",
provider: "anthropic",
key: "qa-mock-not-a-real-key",
});
}
});
it("stages mock profiles only for the requested agents and providers when callers override the defaults", async () => {
const stateDir = await mkdtemp(path.join(os.tmpdir(), "qa-mock-auth-override-"));
cleanups.push(async () => {
await rm(stateDir, { recursive: true, force: true });
});
const cfg = await __testing.stageQaMockAuthProfiles({
cfg: {},
stateDir,
agentIds: ["qa"],
providers: ["openai"],
});
expect(cfg.auth?.profiles?.["qa-mock-openai"]).toMatchObject({
provider: "openai",
mode: "api_key",
});
// Anthropic should NOT be staged when the caller restricts providers.
expect(cfg.auth?.profiles?.["qa-mock-anthropic"]).toBeUndefined();
const qaStore = JSON.parse(
await readFile(path.join(stateDir, "agents", "qa", "agent", "auth-profiles.json"), "utf8"),
) as { profiles: Record<string, unknown> };
expect(qaStore.profiles["qa-mock-openai"]).toBeDefined();
expect(qaStore.profiles["qa-mock-anthropic"]).toBeUndefined();
// main/agent should not exist because it wasn't in the agentIds list.
await expect(
readFile(path.join(stateDir, "agents", "main", "agent", "auth-profiles.json"), "utf8"),
).rejects.toThrow(/ENOENT/);
});
it("allows loopback gateway health probes through the SSRF guard", async () => {
const release = vi.fn(async () => {});
fetchWithSsrFGuardMock.mockResolvedValue({

View File

@@ -222,6 +222,12 @@ export function normalizeQaProviderModeEnv(
return env;
}
export function resolveQaGatewayChildProviderMode(
providerMode?: "mock-openai" | "live-frontier",
): "mock-openai" | "live-frontier" {
return providerMode ?? "mock-openai";
}
function resolveQaLiveCliAuthEnv(
baseEnv: NodeJS.ProcessEnv,
opts?: {
@@ -395,6 +401,72 @@ export async function stageQaLiveAnthropicSetupToken(params: {
});
}
/** Providers the mock-openai harness stages placeholder credentials for. */
export const QA_MOCK_AUTH_PROVIDERS = Object.freeze(["openai", "anthropic"] as const);
/** Agent IDs the mock-openai harness stages credentials under. */
export const QA_MOCK_AUTH_AGENT_IDS = Object.freeze(["main", "qa"] as const);
export function buildQaMockProfileId(provider: string): string {
return `qa-mock-${provider}`;
}
/**
* In mock-openai mode the qa suite runs against the embedded mock server
* instead of a real provider API. The mock does not validate credentials, but
* the agent auth layer still needs a matching `api_key` auth profile in
* `auth-profiles.json` before it will route the request through
* `providerBaseUrl`. Without this staging step, every scenario fails with
* `FailoverError: No API key found for provider "openai"` before the mock
* server ever sees a request.
*
* Stages a placeholder `api_key` profile per provider in each of the agent
* dirs the qa suite uses (`main` for the runtime config, `qa` for scenario
* runs) and returns a config with matching `auth.profiles` entries so the
* runtime accepts the profile on the first lookup.
*
* The placeholder value `qa-mock-not-a-real-key` is intentionally not
* shaped like a real API key (no `sk-` prefix that would trip secret
* scanners). It only needs to be non-empty to pass the credential
* serializer; anything beyond that is ignored by the mock.
*/
export async function stageQaMockAuthProfiles(params: {
cfg: OpenClawConfig;
stateDir: string;
agentIds?: readonly string[];
providers?: readonly string[];
}): Promise<OpenClawConfig> {
const agentIds = [...new Set(params.agentIds ?? QA_MOCK_AUTH_AGENT_IDS)];
const providers = [...new Set(params.providers ?? QA_MOCK_AUTH_PROVIDERS)];
let next = params.cfg;
for (const agentId of agentIds) {
const agentDir = path.join(params.stateDir, "agents", agentId, "agent");
await fs.mkdir(agentDir, { recursive: true });
for (const provider of providers) {
const profileId = buildQaMockProfileId(provider);
upsertAuthProfile({
profileId,
credential: {
type: "api_key",
provider,
key: "qa-mock-not-a-real-key",
displayName: `QA mock ${provider} credential`,
},
agentDir,
});
}
}
for (const provider of providers) {
next = applyAuthProfileConfig(next, {
profileId: buildQaMockProfileId(provider),
provider,
mode: "api_key",
displayName: `QA mock ${provider} credential`,
});
}
return next;
}
function isRetryableGatewayCallError(details: string): boolean {
return (
details.includes("handshake timeout") ||
@@ -440,8 +512,10 @@ export const __testing = {
preserveQaGatewayDebugArtifacts,
redactQaGatewayDebugText,
readQaLiveProviderConfigOverrides,
resolveQaGatewayChildProviderMode,
resolveQaLiveAnthropicSetupToken,
stageQaLiveAnthropicSetupToken,
stageQaMockAuthProfiles,
resolveQaLiveCliAuthEnv,
resolveQaOwnerPluginIdsForProviderIds,
resolveQaBundledPluginsSourceRoot,
@@ -868,8 +942,9 @@ export async function startQaGatewayChild(params: {
fs.mkdir(xdgDataHome, { recursive: true }),
fs.mkdir(xdgCacheHome, { recursive: true }),
]);
const providerMode = resolveQaGatewayChildProviderMode(params.providerMode);
const liveProviderIds =
params.providerMode === "live-frontier"
providerMode === "live-frontier"
? [params.primaryModel, params.alternateModel]
.map((modelRef) =>
typeof modelRef === "string" ? splitQaModelRef(modelRef)?.provider : undefined,
@@ -902,7 +977,7 @@ export async function startQaGatewayChild(params: {
controlUiEnabled: params.controlUiEnabled,
}),
controlUiAllowedOrigins: params.controlUiAllowedOrigins,
providerMode: params.providerMode,
providerMode,
primaryModel: params.primaryModel,
alternateModel: params.alternateModel,
enabledPluginIds,
@@ -921,6 +996,12 @@ export async function startQaGatewayChild(params: {
cfg,
stateDir,
});
if (providerMode === "mock-openai") {
cfg = await stageQaMockAuthProfiles({
cfg,
stateDir,
});
}
return params.mutateConfig ? params.mutateConfig(cfg) : cfg;
};
const stdout: Buffer[] = [];
@@ -981,7 +1062,7 @@ export async function startQaGatewayChild(params: {
xdgCacheHome,
bundledPluginsDir,
compatibilityHostVersion: runtimeHostVersion,
providerMode: params.providerMode,
providerMode,
forwardHostHomeForClaudeCli: liveProviderIds.includes("claude-cli"),
claudeCliAuthMode: params.claudeCliAuthMode,
});

File diff suppressed because it is too large Load Diff

View File

@@ -22,6 +22,58 @@ type StreamEvent =
};
};
/**
* Provider variant tag for `body.model`. The mock previously ignored
* `body.model` for dispatch and only echoed it in the prose output, which
* made the parity gate tautological when run against the mock alone
* (both providers produced identical scenario plans by construction).
* Tagging requests with a normalized variant lets individual scenario
* branches opt into provider-specific behavior while the rest of the
* dispatcher stays shared, and lets `/debug/requests` consumers verify
* which provider lane a given request came from without re-parsing the
* raw model string.
*
* Policy:
* - `openai/*`, `gpt-*`, `o1-*`, anything starting with `gpt-` → `"openai"`
* - `anthropic/*`, `claude-*` → `"anthropic"`
* - Everything else (including empty strings) → `"unknown"`
*
* The `/v1/messages` route always feeds `body.model` straight through,
* so an Anthropic request with an `openai/gpt-5.4` model string is still
* classified as `"openai"`. That matches the parity program's convention
* where the provider label is the source of truth, not the HTTP route.
*/
export type MockOpenAiProviderVariant = "openai" | "anthropic" | "unknown";
export function resolveProviderVariant(model: string | undefined): MockOpenAiProviderVariant {
if (typeof model !== "string") {
return "unknown";
}
const trimmed = model.trim().toLowerCase();
if (trimmed.length === 0) {
return "unknown";
}
// Prefer the explicit `provider/model` or `provider:model` prefix when
// the caller supplied one — that's the most reliable signal.
const separatorMatch = /^([^/:]+)[/:]/.exec(trimmed);
const provider = separatorMatch?.[1] ?? trimmed;
if (provider === "openai" || provider === "openai-codex") {
return "openai";
}
if (provider === "anthropic" || provider === "claude-cli") {
return "anthropic";
}
// Fall back to model-name prefix matching for bare model strings like
// `gpt-5.4` or `claude-opus-4-6`.
if (/^(?:gpt-|o1-|openai-)/.test(trimmed)) {
return "openai";
}
if (/^(?:claude-|anthropic-)/.test(trimmed)) {
return "anthropic";
}
return "unknown";
}
type MockOpenAiRequestSnapshot = {
raw: string;
body: Record<string, unknown>;
@@ -30,13 +82,52 @@ type MockOpenAiRequestSnapshot = {
instructions?: string;
toolOutput: string;
model: string;
providerVariant: MockOpenAiProviderVariant;
imageInputCount: number;
plannedToolName?: string;
};
// Anthropic /v1/messages request/response shapes the mock actually needs.
// This is a subset of the real Anthropic Messages API — just enough so the
// QA suite can run its parity pack against a "baseline" Anthropic provider
// without needing real API keys. The scenarios drive their dispatch through
// the shared mock scenario logic (buildResponsesPayload), so whatever
// behavior the OpenAI mock exposes is automatically mirrored on this route.
type AnthropicMessageContentBlock =
| { type: "text"; text: string }
| {
type: "tool_use";
id: string;
name: string;
input: Record<string, unknown>;
}
| {
type: "tool_result";
tool_use_id: string;
content: string | Array<{ type: "text"; text: string }>;
}
| { type: "image"; source: Record<string, unknown> };
type AnthropicMessage = {
role: "user" | "assistant";
content: string | AnthropicMessageContentBlock[];
};
type AnthropicMessagesRequest = {
model?: string;
max_tokens?: number;
system?: string | Array<{ type: "text"; text: string }>;
messages?: AnthropicMessage[];
tools?: Array<Record<string, unknown>>;
stream?: boolean;
};
const TINY_PNG_BASE64 =
"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/x8AAwMCAO7Z0nQAAAAASUVORK5CYII=";
let subagentFanoutPhase = 0;
type MockScenarioState = {
subagentFanoutPhase: number;
};
function readBody(req: IncomingMessage): Promise<string> {
return new Promise((resolve, reject) => {
@@ -68,6 +159,23 @@ function writeSse(res: ServerResponse, events: StreamEvent[]) {
res.end(body);
}
type AnthropicStreamEvent = Record<string, unknown> & {
type: string;
};
function writeAnthropicSse(res: ServerResponse, events: AnthropicStreamEvent[]) {
const body = events
.map((event) => `event: ${event.type}\ndata: ${JSON.stringify(event)}\n\n`)
.join("");
res.writeHead(200, {
"content-type": "text/event-stream",
"cache-control": "no-store",
connection: "keep-alive",
"content-length": Buffer.byteLength(body),
});
res.end(body);
}
function countApproxTokens(text: string) {
const trimmed = text.trim();
if (!trimmed) {
@@ -376,11 +484,11 @@ function extractLastCapture(text: string, pattern: RegExp) {
}
function extractExactReplyDirective(text: string) {
const colonMatch = extractLastCapture(text, /reply(?: with)? exactly:\s*([^\n]+)/i);
if (colonMatch) {
return colonMatch;
const backtickedMatch = extractLastCapture(text, /reply(?: with)? exactly\s+`([^`]+)`/i);
if (backtickedMatch) {
return backtickedMatch;
}
return extractLastCapture(text, /reply(?: with)? exactly\s+`([^`]+)`/i);
return extractLastCapture(text, /reply(?: with)? exactly:\s*([^\n]+)/i);
}
function extractExactMarkerDirective(text: string) {
@@ -392,10 +500,18 @@ function extractExactMarkerDirective(text: string) {
}
function isHeartbeatPrompt(text: string) {
return /Read HEARTBEAT\.md if it exists/i.test(text);
const trimmed = text.trim();
if (!trimmed || /remember this fact/i.test(trimmed)) {
return false;
}
return /(?:^|\n)Read HEARTBEAT\.md if it exists\b/i.test(trimmed);
}
function buildAssistantText(input: ResponsesInputItem[], body: Record<string, unknown>) {
function buildAssistantText(
input: ResponsesInputItem[],
body: Record<string, unknown>,
scenarioState: MockScenarioState,
) {
const prompt = extractLastUserText(input);
const toolOutput = extractToolOutput(input);
const toolJson = parseToolOutputJson(toolOutput);
@@ -411,8 +527,10 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
: toolOutput;
const orbitCode = extractOrbitCode(memorySnippet);
const mediaPath = /MEDIA:([^\n]+)/.exec(toolOutput)?.[1]?.trim();
const exactReplyDirective = extractExactReplyDirective(allInputText);
const exactMarkerDirective = extractExactMarkerDirective(allInputText);
const exactReplyDirective =
extractExactReplyDirective(prompt) ?? extractExactReplyDirective(allInputText);
const exactMarkerDirective =
extractExactMarkerDirective(prompt) ?? extractExactMarkerDirective(allInputText);
const imageInputCount = countImageInputs(input);
const activeMemorySummary = extractActiveMemorySummary(allInputText);
const snackPreference = extractSnackPreference(activeMemorySummary ?? memorySnippet);
@@ -456,6 +574,23 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
if (/tool continuity check/i.test(prompt) && toolOutput) {
return `Protocol note: model switch handoff confirmed on ${model || "the requested model"}. QA mission from QA_KICKOFF_TASK.md still applies: understand this OpenClaw repo from source + docs before acting.`;
}
if (toolOutput && /repo contract followthrough check/i.test(prompt)) {
if (
/successfully (?:wrote|created|updated|replaced)/i.test(toolOutput) ||
/status:\s*complete/i.test(toolOutput)
) {
return [
"Read: AGENT.md, SOUL.md, FOLLOWTHROUGH_INPUT.md",
"Wrote: repo-contract-summary.txt",
"Status: complete",
].join("\n");
}
return [
"Read: AGENT.md, SOUL.md, FOLLOWTHROUGH_INPUT.md",
"Wrote: repo-contract-summary.txt",
"Status: blocked",
].join("\n");
}
if (/session memory ranking check/i.test(prompt) && orbitCode) {
return `Protocol note: I checked memory and the current Project Nebula codename is ${orbitCode}.`;
}
@@ -489,7 +624,11 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
if (/fanout worker beta/i.test(prompt)) {
return "BETA-OK";
}
if (/subagent fanout synthesis check/i.test(prompt) && toolOutput && subagentFanoutPhase >= 2) {
if (
/subagent fanout synthesis check/i.test(prompt) &&
toolOutput &&
scenarioState.subagentFanoutPhase >= 2
) {
return "Protocol note: delegated fanout complete. Alpha=ALPHA-OK. Beta=BETA-OK.";
}
if (toolOutput && (/\bdelegate\b/i.test(prompt) || /subagent handoff/i.test(prompt))) {
@@ -579,7 +718,10 @@ function buildAssistantEvents(text: string): StreamEvent[] {
];
}
async function buildResponsesPayload(body: Record<string, unknown>) {
async function buildResponsesPayload(
body: Record<string, unknown>,
scenarioState: MockScenarioState,
) {
const input = Array.isArray(body.input) ? (body.input as ResponsesInputItem[]) : [];
const prompt = extractLastUserText(input);
const toolOutput = extractToolOutput(input);
@@ -587,6 +729,9 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
const allInputText = extractAllRequestTexts(input, body);
const isGroupChat = allInputText.includes('"is_group_chat": true');
const isBaselineUnmentionedChannelChatter = /\bno bot ping here\b/i.test(prompt);
if (/remember this fact/i.test(prompt)) {
return buildAssistantEvents(buildAssistantText(input, body, scenarioState));
}
if (isHeartbeatPrompt(prompt)) {
return buildAssistantEvents("HEARTBEAT_OK");
}
@@ -756,16 +901,16 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
});
}
if (/subagent fanout synthesis check/i.test(prompt)) {
if (!toolOutput && subagentFanoutPhase === 0) {
subagentFanoutPhase = 1;
if (!toolOutput && scenarioState.subagentFanoutPhase === 0) {
scenarioState.subagentFanoutPhase = 1;
return buildToolCallEventsWithArgs("sessions_spawn", {
task: "Fanout worker alpha: inspect the QA workspace and finish with exactly ALPHA-OK.",
label: "qa-fanout-alpha",
thread: false,
});
}
if (toolOutput && subagentFanoutPhase === 1) {
subagentFanoutPhase = 2;
if (toolOutput && scenarioState.subagentFanoutPhase === 1) {
scenarioState.subagentFanoutPhase = 2;
return buildToolCallEventsWithArgs("sessions_spawn", {
task: "Fanout worker beta: inspect the QA workspace and finish with exactly BETA-OK.",
label: "qa-fanout-beta",
@@ -776,6 +921,30 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
if (/tool continuity check/i.test(prompt) && !toolOutput) {
return buildToolCallEventsWithArgs("read", { path: "QA_KICKOFF_TASK.md" });
}
if (/repo contract followthrough check/i.test(prompt)) {
if (!toolOutput) {
return buildToolCallEventsWithArgs("read", { path: "AGENT.md" });
}
if (toolOutput.includes("# Repo contract")) {
return buildToolCallEventsWithArgs("read", { path: "SOUL.md" });
}
if (toolOutput.includes("# Execution style")) {
return buildToolCallEventsWithArgs("read", { path: "FOLLOWTHROUGH_INPUT.md" });
}
if (
toolOutput.includes("Mission: prove you followed the repo contract.") &&
toolOutput.includes("Evidence path: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md")
) {
return buildToolCallEventsWithArgs("write", {
path: "repo-contract-summary.txt",
content: [
"Mission: prove you followed the repo contract.",
"Evidence: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md",
"Status: complete",
].join("\n"),
});
}
}
if ((/\bdelegate\b/i.test(prompt) || /subagent handoff/i.test(prompt)) && !toolOutput) {
return buildToolCallEventsWithArgs("sessions_spawn", {
task: "Inspect the QA workspace and return one concise protocol note.",
@@ -807,12 +976,390 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
) {
await sleep(60_000);
}
return buildAssistantEvents(buildAssistantText(input, body));
return buildAssistantEvents(buildAssistantText(input, body, scenarioState));
}
// ---------------------------------------------------------------------------
// Anthropic /v1/messages adapter
// ---------------------------------------------------------------------------
//
// The QA parity gate needs two comparable scenario runs: one against the
// "candidate" (openai/gpt-5.4) and one against the "baseline"
// (anthropic/claude-opus-4-6). The OpenAI mock above already dispatches all
// the scenario prompt branches we care about. Rather than duplicating that
// machinery, the /v1/messages route below translates Anthropic request
// shapes into the shared ResponsesInputItem[] format, calls the same
// buildResponsesPayload() dispatcher, and then re-serializes the resulting
// events into an Anthropic response. This gives the parity harness a
// baseline lane that exercises the same scenario logic without requiring
// real Anthropic API keys.
//
// Scope: handles Anthropic Messages requests with text and tool_result
// content blocks, supporting both non-streaming JSON responses and the
// streaming SSE path used by the parity harness.
function normalizeAnthropicSystemToString(
system: AnthropicMessagesRequest["system"],
): string | undefined {
if (typeof system === "string") {
return system.trim() || undefined;
}
if (Array.isArray(system)) {
const joined = system
.map((block) => (block?.type === "text" ? block.text : ""))
.filter(Boolean)
.join("\n")
.trim();
return joined || undefined;
}
return undefined;
}
function stringifyToolResultContent(
content: Extract<AnthropicMessageContentBlock, { type: "tool_result" }>["content"],
): string {
if (typeof content === "string") {
return content;
}
if (Array.isArray(content)) {
return content
.map((block) => (block?.type === "text" ? block.text : ""))
.filter(Boolean)
.join("\n");
}
return "";
}
function convertAnthropicMessagesToResponsesInput(params: {
system?: AnthropicMessagesRequest["system"];
messages: AnthropicMessage[];
}): ResponsesInputItem[] {
const items: ResponsesInputItem[] = [];
const systemText = normalizeAnthropicSystemToString(params.system);
if (systemText) {
items.push({
role: "system",
content: [{ type: "input_text", text: systemText }],
});
}
for (const message of params.messages) {
const content = message.content;
if (typeof content === "string") {
items.push({
role: message.role,
content: [
message.role === "assistant"
? { type: "output_text", text: content }
: { type: "input_text", text: content },
],
});
continue;
}
if (!Array.isArray(content)) {
continue;
}
// Buffer each block type so we can push in OpenAI-Responses order instead
// of the order they appear in the Anthropic content array. The parent
// role message must precede any function_call_output items from the same
// turn, otherwise extractToolOutput() (which scans for
// function_call_output AFTER the last user-role index) will not see the
// output and the downstream scenario dispatcher will behave as if no
// tool output was returned. Similarly, assistant tool_use blocks become
// function_call items that must follow the assistant text message they
// narrate.
const textPieces: Array<{ type: "input_text" | "output_text"; text: string }> = [];
const imagePieces: Array<{ type: "input_image"; image_url: string }> = [];
const toolResultItems: ResponsesInputItem[] = [];
const toolUseItems: ResponsesInputItem[] = [];
for (const block of content) {
if (!block || typeof block !== "object") {
continue;
}
if (block.type === "text") {
textPieces.push({
type: message.role === "assistant" ? "output_text" : "input_text",
text: block.text ?? "",
});
continue;
}
if (block.type === "image") {
// Mock only needs to count image inputs; a placeholder URL is fine.
imagePieces.push({ type: "input_image", image_url: "anthropic-mock:image" });
continue;
}
if (block.type === "tool_result") {
const output = stringifyToolResultContent(block.content);
if (output.trim()) {
toolResultItems.push({ type: "function_call_output", output });
}
continue;
}
if (block.type === "tool_use") {
// Mirror OpenAI's function_call output_item shape so downstream
// prompt extraction still sees "the assistant just emitted a tool
// call". The scenario dispatcher looks for tool_output on the next
// user turn, not the assistant's prior tool_use, so a minimal
// placeholder is enough.
toolUseItems.push({
type: "function_call",
name: block.name,
arguments: JSON.stringify(block.input ?? {}),
call_id: block.id,
});
continue;
}
}
if (textPieces.length > 0 || imagePieces.length > 0) {
const combinedContent: Array<Record<string, unknown>> = [...textPieces, ...imagePieces];
items.push({ role: message.role, content: combinedContent });
}
// Emit tool_use (assistant prior calls) and tool_result (user-side
// returns) AFTER the parent role message so extractLastUserText and
// extractToolOutput walk the array in the order they expect. For a
// tool_result-only user turn with no text/image blocks, the parent
// message is intentionally omitted — the function_call_output itself
// represents the user's "return the tool output" turn.
for (const toolUse of toolUseItems) {
items.push(toolUse);
}
for (const toolResult of toolResultItems) {
items.push(toolResult);
}
}
return items;
}
type ExtractedAssistantOutput = {
text: string;
toolCalls: Array<{ id: string; name: string; input: Record<string, unknown> }>;
};
function extractFinalAssistantOutputFromEvents(events: StreamEvent[]): ExtractedAssistantOutput {
const toolCalls: ExtractedAssistantOutput["toolCalls"] = [];
let text = "";
for (const event of events) {
if (event.type !== "response.output_item.done") {
continue;
}
const item = event.item as {
type?: unknown;
name?: unknown;
call_id?: unknown;
id?: unknown;
arguments?: unknown;
content?: unknown;
};
if (item.type === "function_call" && typeof item.name === "string") {
let input: Record<string, unknown> = {};
if (typeof item.arguments === "string" && item.arguments.trim()) {
try {
const parsed = JSON.parse(item.arguments) as unknown;
if (parsed && typeof parsed === "object" && !Array.isArray(parsed)) {
input = parsed as Record<string, unknown>;
}
} catch {
// keep empty input on malformed args — mock dispatcher owns arg shape
}
}
toolCalls.push({
id: typeof item.call_id === "string" ? item.call_id : `toolu_mock_${toolCalls.length + 1}`,
name: item.name,
input,
});
continue;
}
if (item.type === "message" && Array.isArray(item.content)) {
for (const piece of item.content as Array<{ type?: unknown; text?: unknown }>) {
if (piece?.type === "output_text" && typeof piece.text === "string") {
text = piece.text;
}
}
}
}
return { text, toolCalls };
}
function buildAnthropicMessageResponse(params: {
model: string;
extracted: ExtractedAssistantOutput;
}): Record<string, unknown> {
const content: Array<Record<string, unknown>> = [];
if (params.extracted.text) {
content.push({ type: "text", text: params.extracted.text });
}
for (const call of params.extracted.toolCalls) {
content.push({
type: "tool_use",
id: call.id,
name: call.name,
input: call.input,
});
}
if (content.length === 0) {
content.push({ type: "text", text: "" });
}
const stopReason = params.extracted.toolCalls.length > 0 ? "tool_use" : "end_turn";
const approxInputTokens = 64;
const approxOutputTokens = Math.max(
16,
countApproxTokens(params.extracted.text) + params.extracted.toolCalls.length * 16,
);
return {
id: `msg_mock_${Math.floor(Math.random() * 1_000_000).toString(16)}`,
type: "message",
role: "assistant",
model: params.model || "claude-opus-4-6",
content,
stop_reason: stopReason,
stop_sequence: null,
usage: {
input_tokens: approxInputTokens,
output_tokens: approxOutputTokens,
},
};
}
function buildAnthropicMessageStreamEvents(params: {
model: string;
extracted: ExtractedAssistantOutput;
}): AnthropicStreamEvent[] {
const approxInputTokens = 64;
const approxOutputTokens = Math.max(
16,
countApproxTokens(params.extracted.text) + params.extracted.toolCalls.length * 16,
);
const messageId = `msg_mock_${Math.floor(Math.random() * 1_000_000).toString(16)}`;
const events: AnthropicStreamEvent[] = [
{
type: "message_start",
message: {
id: messageId,
type: "message",
role: "assistant",
model: params.model || "claude-opus-4-6",
content: [],
stop_reason: null,
stop_sequence: null,
usage: {
input_tokens: approxInputTokens,
output_tokens: 0,
},
},
},
];
let index = 0;
if (params.extracted.text || params.extracted.toolCalls.length === 0) {
events.push({
type: "content_block_start",
index,
content_block: {
type: "text",
text: "",
},
});
if (params.extracted.text) {
events.push({
type: "content_block_delta",
index,
delta: {
type: "text_delta",
text: params.extracted.text,
},
});
}
events.push({
type: "content_block_stop",
index,
});
index += 1;
}
for (const call of params.extracted.toolCalls) {
events.push({
type: "content_block_start",
index,
content_block: {
type: "tool_use",
id: call.id,
name: call.name,
input: {},
},
});
events.push({
type: "content_block_delta",
index,
delta: {
type: "input_json_delta",
partial_json: JSON.stringify(call.input ?? {}),
},
});
events.push({
type: "content_block_stop",
index,
});
index += 1;
}
events.push({
type: "message_delta",
delta: {
stop_reason: params.extracted.toolCalls.length > 0 ? "tool_use" : "end_turn",
},
usage: {
input_tokens: approxInputTokens,
output_tokens: approxOutputTokens,
},
});
events.push({
type: "message_stop",
});
return events;
}
async function buildMessagesPayload(
body: AnthropicMessagesRequest,
scenarioState: MockScenarioState,
): Promise<{
events: StreamEvent[];
input: ResponsesInputItem[];
extracted: ExtractedAssistantOutput;
responseBody: Record<string, unknown>;
streamEvents: AnthropicStreamEvent[];
model: string;
}> {
const messages = Array.isArray(body.messages) ? body.messages : [];
const input = convertAnthropicMessagesToResponsesInput({
system: body.system,
messages,
});
// Treat empty-string model the same as absent. A bare typeof check lets
// `""` leak through to `responseBody.model` and `lastRequest.model`,
// which then confuses parity consumers that assume the mock always
// echoes the real provider label. Normalize once and reuse everywhere.
const normalizedModel =
typeof body.model === "string" && body.model.trim() !== "" ? body.model : "claude-opus-4-6";
// Dispatch through the same scenario logic the /v1/responses route uses.
// The mock dispatcher only reads `body.input`, `body.model`, and
// `body.stream`, so a synthetic shim body is sufficient.
const dispatchBody: Record<string, unknown> = {
input,
model: normalizedModel,
stream: false,
};
const events = await buildResponsesPayload(dispatchBody, scenarioState);
const extracted = extractFinalAssistantOutputFromEvents(events);
const responseBody = buildAnthropicMessageResponse({
model: normalizedModel,
extracted,
});
const streamEvents = buildAnthropicMessageStreamEvents({
model: normalizedModel,
extracted,
});
return { events, input, extracted, responseBody, streamEvents, model: normalizedModel };
}
export async function startQaMockOpenAiServer(params?: { host?: string; port?: number }) {
const host = params?.host ?? "127.0.0.1";
subagentFanoutPhase = 0;
const scenarioState: MockScenarioState = { subagentFanoutPhase: 0 };
let lastRequest: MockOpenAiRequestSnapshot | null = null;
const requests: MockOpenAiRequestSnapshot[] = [];
const imageGenerationRequests: Array<Record<string, unknown>> = [];
@@ -829,6 +1376,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
{ id: "gpt-5.4-alt", object: "model" },
{ id: "gpt-image-1", object: "model" },
{ id: "text-embedding-3-small", object: "model" },
{ id: "claude-opus-4-6", object: "model" },
{ id: "claude-sonnet-4-6", object: "model" },
],
});
return;
@@ -888,7 +1437,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
const raw = await readBody(req);
const body = raw ? (JSON.parse(raw) as Record<string, unknown>) : {};
const input = Array.isArray(body.input) ? (body.input as ResponsesInputItem[]) : [];
const events = await buildResponsesPayload(body);
const events = await buildResponsesPayload(body, scenarioState);
const resolvedModel = typeof body.model === "string" ? body.model : "";
lastRequest = {
raw,
body,
@@ -896,7 +1446,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
allInputText: extractAllRequestTexts(input, body),
instructions: extractInstructionsText(body) || undefined,
toolOutput: extractToolOutput(input),
model: typeof body.model === "string" ? body.model : "",
model: resolvedModel,
providerVariant: resolveProviderVariant(resolvedModel),
imageInputCount: countImageInputs(input),
plannedToolName: extractPlannedToolName(events),
};
@@ -916,6 +1467,56 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
writeSse(res, events);
return;
}
if (req.method === "POST" && url.pathname === "/v1/messages") {
const raw = await readBody(req);
let body: AnthropicMessagesRequest = {};
try {
body = raw ? (JSON.parse(raw) as AnthropicMessagesRequest) : {};
} catch {
writeJson(res, 400, {
type: "error",
error: {
type: "invalid_request_error",
message: "Malformed JSON body for Anthropic Messages request.",
},
});
return;
}
const {
events,
input,
responseBody,
streamEvents,
model: normalizedModel,
} = await buildMessagesPayload(body, scenarioState);
// Record the adapted request snapshot so /debug/requests gives the QA
// suite the same plannedToolName / allInputText / toolOutput signals
// on the Anthropic route that the OpenAI route already exposes. This
// is what lets a single parity run diff assertions across both lanes.
// Reuse the normalized model so an empty-string body.model no longer
// leaks through to `lastRequest.model`.
lastRequest = {
raw,
body: body as Record<string, unknown>,
prompt: extractLastUserText(input),
allInputText: extractAllInputTexts(input),
toolOutput: extractToolOutput(input),
model: normalizedModel,
providerVariant: resolveProviderVariant(normalizedModel),
imageInputCount: countImageInputs(input),
plannedToolName: extractPlannedToolName(events),
};
requests.push(lastRequest);
if (requests.length > 50) {
requests.splice(0, requests.length - 50);
}
if (body.stream === true) {
writeAnthropicSse(res, streamEvents);
return;
}
writeJson(res, 200, responseBody);
return;
}
writeJson(res, 404, { error: "not found" });
});

View File

@@ -53,6 +53,11 @@ describe("buildQaGatewayConfig", () => {
expect(getPrimaryModel(cfg.agents?.defaults?.model)).toBe("mock-openai/gpt-5.4");
expect(cfg.models?.providers?.["mock-openai"]?.baseUrl).toBe("http://127.0.0.1:44080/v1");
expect(cfg.models?.providers?.["mock-openai"]?.request).toEqual({ allowPrivateNetwork: true });
expect(cfg.models?.providers?.openai?.baseUrl).toBe("http://127.0.0.1:44080/v1");
expect(cfg.models?.providers?.openai?.request).toEqual({ allowPrivateNetwork: true });
expect(cfg.models?.providers?.anthropic?.baseUrl).toBe("http://127.0.0.1:44080");
expect(cfg.models?.providers?.anthropic?.request).toEqual({ allowPrivateNetwork: true });
expect(cfg.plugins?.allow).toEqual(["memory-core", "qa-channel"]);
expect(cfg.plugins?.entries?.["memory-core"]).toEqual({ enabled: true });
expect(cfg.plugins?.entries?.["qa-channel"]).toEqual({ enabled: true });
@@ -66,6 +71,31 @@ describe("buildQaGatewayConfig", () => {
expect(cfg.messages?.groupChat?.mentionPatterns).toEqual(["\\b@?openclaw\\b"]);
});
it("maps provider-qualified openai and anthropic refs through the mock provider lane", () => {
const cfg = buildQaGatewayConfig({
bind: "loopback",
gatewayPort: 18789,
gatewayToken: "token",
providerBaseUrl: "http://127.0.0.1:44080/v1",
workspaceDir: "/tmp/qa-workspace",
providerMode: "mock-openai",
primaryModel: "openai/gpt-5.4",
alternateModel: "anthropic/claude-opus-4-6",
});
expect(getPrimaryModel(cfg.agents?.defaults?.model)).toBe("openai/gpt-5.4");
expect(cfg.models?.providers?.openai?.api).toBe("openai-responses");
expect(cfg.models?.providers?.openai?.request).toEqual({ allowPrivateNetwork: true });
expect(cfg.models?.providers?.openai?.models.map((model) => model.id)).toContain("gpt-5.4");
expect(cfg.models?.providers?.anthropic?.api).toBe("anthropic-messages");
expect(cfg.models?.providers?.anthropic?.baseUrl).toBe("http://127.0.0.1:44080");
expect(cfg.models?.providers?.anthropic?.request).toEqual({ allowPrivateNetwork: true });
expect(cfg.models?.providers?.anthropic?.models.map((model) => model.id)).toContain(
"claude-opus-4-6",
);
expect(cfg.plugins?.allow).toEqual(["memory-core"]);
});
it("can omit qa-channel for live transport gateway children", () => {
const cfg = buildQaGatewayConfig({
bind: "loopback",

View File

@@ -45,6 +45,10 @@ export function normalizeQaThinkingLevel(input: unknown): QaThinkingLevel | unde
return undefined;
}
function trimTrailingApiV1(baseUrl: string) {
return baseUrl.replace(/\/v1\/?$/i, "");
}
export function mergeQaControlUiAllowedOrigins(extraOrigins?: string[]) {
const normalizedExtra = (extraOrigins ?? [])
.map((origin) => origin.trim())
@@ -74,10 +78,14 @@ export function buildQaGatewayConfig(params: {
thinkingDefault?: QaThinkingLevel;
}): OpenClawConfig {
const mockProviderBaseUrl = params.providerBaseUrl ?? "http://127.0.0.1:44080/v1";
const mockAnthropicBaseUrl = trimTrailingApiV1(mockProviderBaseUrl);
const mockOpenAiProvider: ModelProviderConfig = {
baseUrl: mockProviderBaseUrl,
apiKey: "test",
api: "openai-responses",
request: {
allowPrivateNetwork: true,
},
models: [
{
id: "gpt-5.4",
@@ -126,6 +134,50 @@ export function buildQaGatewayConfig(params: {
},
],
};
const mockNamedOpenAiProvider: ModelProviderConfig = {
...mockOpenAiProvider,
models: mockOpenAiProvider.models.map((model) => ({ ...model })),
};
const mockAnthropicProvider: ModelProviderConfig = {
baseUrl: mockAnthropicBaseUrl,
apiKey: "test",
api: "anthropic-messages",
request: {
allowPrivateNetwork: true,
},
models: [
{
id: "claude-opus-4-6",
name: "claude-opus-4-6",
api: "anthropic-messages",
reasoning: false,
input: ["text", "image"],
cost: {
input: 0,
output: 0,
cacheRead: 0,
cacheWrite: 0,
},
contextWindow: 200_000,
maxTokens: 4096,
},
{
id: "claude-sonnet-4-6",
name: "claude-sonnet-4-6",
api: "anthropic-messages",
reasoning: false,
input: ["text", "image"],
cost: {
input: 0,
output: 0,
cacheRead: 0,
cacheWrite: 0,
},
contextWindow: 200_000,
maxTokens: 4096,
},
],
};
const providerMode = normalizeQaProviderMode(params.providerMode ?? "mock-openai");
const primaryModel = params.primaryModel ?? defaultQaModelForMode(providerMode);
const alternateModel =
@@ -273,6 +325,8 @@ export function buildQaGatewayConfig(params: {
mode: "replace",
providers: {
"mock-openai": mockOpenAiProvider,
openai: mockNamedOpenAiProvider,
anthropic: mockAnthropicProvider,
},
},
}

View File

@@ -118,6 +118,50 @@ describe("qa scenario catalog", () => {
);
});
it("keeps mock-only image debug assertions guarded in live-frontier runs", () => {
const scenario = readQaScenarioPack().scenarios.find(
(candidate) => candidate.id === "image-understanding-attachment",
);
const imageRequestAction = scenario?.execution.flow?.steps
.flatMap((step) => step.actions ?? [])
.find(
(
action,
): action is {
set: string;
value?: { expr?: string };
} =>
typeof action === "object" &&
action !== null &&
"set" in action &&
action.set === "imageRequest",
);
const imageRequestExpr = imageRequestAction?.value?.expr;
expect(imageRequestExpr).toContain("env.mock ?");
expect(imageRequestExpr).toContain("/debug/requests");
});
it("adds a repo-instruction followthrough scenario to the parity pack", () => {
const scenario = readQaScenarioById("instruction-followthrough-repo-contract");
const config = readQaScenarioExecutionConfig("instruction-followthrough-repo-contract") as
| {
workspaceFiles?: Record<string, string>;
prompt?: string;
expectedReplyAll?: string[];
}
| undefined;
expect(config?.workspaceFiles?.["AGENT.md"]).toContain("Step order:");
expect(config?.workspaceFiles?.["SOUL.md"]).toContain("action-first");
expect(config?.workspaceFiles?.["FOLLOWTHROUGH_INPUT.md"]).toContain(
"Mission: prove you followed the repo contract.",
);
expect(config?.prompt).toContain("Repo contract followthrough check.");
expect(config?.expectedReplyAll).toEqual(["read:", "wrote:", "status:"]);
expect(scenario.title).toBe("Instruction followthrough repo contract");
});
it("rejects malformed string matcher lists before running a flow", () => {
expect(() =>
validateQaScenarioExecutionConfig({

View File

@@ -0,0 +1,101 @@
import { describe, expect, it } from "vitest";
import { buildQaSuiteSummaryJson } from "./suite.js";
describe("buildQaSuiteSummaryJson", () => {
const baseParams = {
// Test scenarios include a `steps: []` field to match the real suite
// scenario-result shape so downstream consumers that rely on the shape
// (parity gate, report render) stay aligned.
scenarios: [
{ name: "Scenario A", status: "pass" as const, steps: [] },
{ name: "Scenario B", status: "fail" as const, details: "something broke", steps: [] },
],
startedAt: new Date("2026-04-11T00:00:00.000Z"),
finishedAt: new Date("2026-04-11T00:05:00.000Z"),
providerMode: "mock-openai" as const,
primaryModel: "openai/gpt-5.4",
alternateModel: "openai/gpt-5.4-alt",
fastMode: true,
concurrency: 2,
};
it("records provider/model/mode so parity gates can verify labels", () => {
const json = buildQaSuiteSummaryJson(baseParams);
expect(json.run).toMatchObject({
startedAt: "2026-04-11T00:00:00.000Z",
finishedAt: "2026-04-11T00:05:00.000Z",
providerMode: "mock-openai",
primaryModel: "openai/gpt-5.4",
primaryProvider: "openai",
primaryModelName: "gpt-5.4",
alternateModel: "openai/gpt-5.4-alt",
alternateProvider: "openai",
alternateModelName: "gpt-5.4-alt",
fastMode: true,
concurrency: 2,
scenarioIds: null,
});
});
it("includes scenarioIds in run metadata when provided", () => {
const scenarioIds = ["approval-turn-tool-followthrough", "subagent-handoff", "memory-recall"];
const json = buildQaSuiteSummaryJson({
...baseParams,
scenarioIds,
});
expect(json.run.scenarioIds).toEqual(scenarioIds);
});
it("treats an empty scenarioIds array as unspecified (no filter)", () => {
// A CLI path that omits --scenario passes an empty array to runQaSuite.
// The summary must encode that as null so downstream parity/report
// tooling doesn't interpret a full run as an explicit empty selection.
const json = buildQaSuiteSummaryJson({
...baseParams,
scenarioIds: [],
});
expect(json.run.scenarioIds).toBeNull();
});
it("records an Anthropic baseline lane cleanly for parity runs", () => {
const json = buildQaSuiteSummaryJson({
...baseParams,
primaryModel: "anthropic/claude-opus-4-6",
alternateModel: "anthropic/claude-sonnet-4-6",
});
expect(json.run).toMatchObject({
primaryModel: "anthropic/claude-opus-4-6",
primaryProvider: "anthropic",
primaryModelName: "claude-opus-4-6",
alternateModel: "anthropic/claude-sonnet-4-6",
alternateProvider: "anthropic",
alternateModelName: "claude-sonnet-4-6",
});
});
it("leaves split fields null when a model ref is malformed", () => {
const json = buildQaSuiteSummaryJson({
...baseParams,
primaryModel: "not-a-real-ref",
alternateModel: "",
});
expect(json.run).toMatchObject({
primaryModel: "not-a-real-ref",
primaryProvider: null,
primaryModelName: null,
alternateModel: "",
alternateProvider: null,
alternateModelName: null,
});
});
it("keeps scenarios and counts alongside the run metadata", () => {
const json = buildQaSuiteSummaryJson(baseParams);
expect(json.scenarios).toHaveLength(2);
expect(json.counts).toEqual({
total: 2,
passed: 1,
failed: 1,
});
});
});

View File

@@ -81,7 +81,7 @@ type QaSuiteStep = {
run: () => Promise<string | void>;
};
type QaSuiteScenarioResult = {
export type QaSuiteScenarioResult = {
name: string;
status: "pass" | "fail";
steps: QaReportCheck[];
@@ -1365,17 +1365,105 @@ function createQaSuiteReportNotes(params: {
return params.transport.createReportNotes(params);
}
export type QaSuiteSummaryJsonParams = {
scenarios: QaSuiteScenarioResult[];
startedAt: Date;
finishedAt: Date;
providerMode: QaProviderMode;
primaryModel: string;
alternateModel: string;
fastMode: boolean;
concurrency: number;
scenarioIds?: readonly string[];
};
/**
* Strongly-typed shape of `qa-suite-summary.json`. The GPT-5.4 parity gate
* (agentic-parity-report.ts, #64441) and any future parity wrapper can
* import this type instead of re-declaring the shape, so changes to the
* summary schema propagate through to every consumer at type-check time.
*/
export type QaSuiteSummaryJson = {
scenarios: QaSuiteScenarioResult[];
counts: {
total: number;
passed: number;
failed: number;
};
run: {
startedAt: string;
finishedAt: string;
providerMode: QaProviderMode;
primaryModel: string;
primaryProvider: string | null;
primaryModelName: string | null;
alternateModel: string;
alternateProvider: string | null;
alternateModelName: string | null;
fastMode: boolean;
concurrency: number;
scenarioIds: string[] | null;
};
};
/**
* Pure-ish JSON builder for qa-suite-summary.json. Exported so the GPT-5.4
* parity gate (agentic-parity-report.ts, #64441) and any future parity
* runner can assert-and-trust the provider/model that produced a given
* summary instead of blindly accepting the caller's candidateLabel /
* baselineLabel. Without the `run` block, a maintainer who swaps candidate
* and baseline summary paths could silently produce a mislabeled verdict.
*
* `scenarioIds` is only recorded when the caller passed a non-empty array
* (an explicit scenario selection). A missing or empty array means "no
* filter, full lane-selected catalog", which the summary encodes as `null`
* so parity/report tooling doesn't mistake a full run for an explicit
* empty selection.
*/
export function buildQaSuiteSummaryJson(params: QaSuiteSummaryJsonParams): QaSuiteSummaryJson {
const primarySplit = splitModelRef(params.primaryModel);
const alternateSplit = splitModelRef(params.alternateModel);
return {
scenarios: params.scenarios,
counts: {
total: params.scenarios.length,
passed: params.scenarios.filter((scenario) => scenario.status === "pass").length,
failed: params.scenarios.filter((scenario) => scenario.status === "fail").length,
},
run: {
startedAt: params.startedAt.toISOString(),
finishedAt: params.finishedAt.toISOString(),
providerMode: params.providerMode,
primaryModel: params.primaryModel,
primaryProvider: primarySplit?.provider ?? null,
primaryModelName: primarySplit?.model ?? null,
alternateModel: params.alternateModel,
alternateProvider: alternateSplit?.provider ?? null,
alternateModelName: alternateSplit?.model ?? null,
fastMode: params.fastMode,
concurrency: params.concurrency,
scenarioIds:
params.scenarioIds && params.scenarioIds.length > 0 ? [...params.scenarioIds] : null,
},
};
}
async function writeQaSuiteArtifacts(params: {
outputDir: string;
startedAt: Date;
finishedAt: Date;
scenarios: QaSuiteScenarioResult[];
transport: QaTransportAdapter;
providerMode: "mock-openai" | "live-frontier";
// Reuse the canonical QaProviderMode union instead of re-declaring it
// inline. Loop 6 already unified `QaSuiteSummaryJsonParams.providerMode`
// on this type; keeping the writer in sync prevents drift when model-
// selection.ts adds a new provider mode.
providerMode: QaProviderMode;
primaryModel: string;
alternateModel: string;
fastMode: boolean;
concurrency: number;
scenarioIds?: readonly string[];
}) {
const report = renderQaMarkdownReport({
title: "OpenClaw QA Scenario Suite",
@@ -1395,18 +1483,7 @@ async function writeQaSuiteArtifacts(params: {
await fs.writeFile(reportPath, report, "utf8");
await fs.writeFile(
summaryPath,
`${JSON.stringify(
{
scenarios: params.scenarios,
counts: {
total: params.scenarios.length,
passed: params.scenarios.filter((scenario) => scenario.status === "pass").length,
failed: params.scenarios.filter((scenario) => scenario.status === "fail").length,
},
},
null,
2,
)}\n`,
`${JSON.stringify(buildQaSuiteSummaryJson(params), null, 2)}\n`,
"utf8",
);
return { report, reportPath, summaryPath };
@@ -1576,6 +1653,16 @@ export async function runQaSuite(params?: QaSuiteRunParams): Promise<QaSuiteResu
alternateModel,
fastMode,
concurrency,
// When the caller supplied an explicit non-empty --scenario filter,
// record the executed (post-selectQaSuiteScenarios-normalized) ids
// so the summary matches what actually ran. When the caller passed
// nothing or an empty array ("no filter, full lane catalog"),
// preserve the unfiltered = null semantic so the summary stays
// distinguishable from an explicit all-scenarios selection.
scenarioIds:
params?.scenarioIds && params.scenarioIds.length > 0
? selectedCatalogScenarios.map((scenario) => scenario.id)
: undefined,
});
lab.setLatestReport({
outputPath: reportPath,
@@ -1737,6 +1824,12 @@ export async function runQaSuite(params?: QaSuiteRunParams): Promise<QaSuiteResu
alternateModel,
fastMode,
concurrency,
// Same "filtered → executed list, unfiltered → null" convention as
// the concurrent-path writeQaSuiteArtifacts call above.
scenarioIds:
params?.scenarioIds && params.scenarioIds.length > 0
? selectedCatalogScenarios.map((scenario) => scenario.id)
: undefined,
});
const latestReport = {
outputPath: reportPath,

View File

@@ -151,6 +151,20 @@ steps:
ref: imageStartedAtMs
timeoutMs:
expr: liveTurnTimeoutMs(env, 45000)
# Tool-call assertion (criterion 2 of the parity completion
# gate in #64227): the restored `image_generate` capability
# must have actually fired as a real tool call. Without this
# assertion, a prose reply that just mentions a MEDIA path
# could satisfy the scenario, so strengthen it by requiring
# the mock to have recorded `plannedToolName: "image_generate"`
# against a post-restart request. The `!env.mock || ...`
# guard means this check only runs in mock mode (where
# `/debug/requests` is available); live-frontier runs skip
# it and still pass the rest of the scenario.
- assert:
expr: "!env.mock || [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].some((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check') && request.plannedToolName === 'image_generate')"
message:
expr: "`expected image_generate tool call during capability flip scenario, saw plannedToolNames=${JSON.stringify([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check')).map((request) => request.plannedToolName ?? null))}`"
finally:
- call: patchConfig
args:

View File

@@ -64,9 +64,26 @@ steps:
expr: "!missingColorGroup"
message:
expr: "`missing expected colors in image description: ${outbound.text}`"
# Image-processing assertion: verify the mock actually received an
# image on the scenario-unique prompt. This is as strong as a
# tool-call assertion for this scenario — unlike the
# `source-docs-discovery-report` / `subagent-handoff` /
# `config-restart-capability-flip` scenarios that rely on a real
# tool call to satisfy the parity criterion, image understanding
# is handled inside the provider's vision capability and does NOT
# emit a tool call the mock can record as `plannedToolName`. The
# `imageInputCount` field IS the tool-call evidence for vision
# scenarios: it proves the attachment reached the provider, which
# is the only thing an external harness can verify in mock mode.
# Match on the scenario-unique prompt substring so the assertion
# can't be accidentally satisfied by some other scenario's image
# request that happens to share a debug log with this one.
- set: imageRequest
value:
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].find((request) => String(request.prompt ?? '').includes('Image understanding check')) : null"
- assert:
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0) >= 1)"
expr: "!env.mock || (imageRequest && (imageRequest.imageInputCount ?? 0) >= 1)"
message:
expr: "`expected at least one input image, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0)}`"
expr: "`expected at least one input image on the Image understanding check request, got imageInputCount=${String(imageRequest?.imageInputCount ?? 0)}`"
detailsExpr: outbound.text
```

View File

@@ -0,0 +1,127 @@
# Instruction followthrough repo contract
```yaml qa-scenario
id: instruction-followthrough-repo-contract
title: Instruction followthrough repo contract
surface: repo-contract
objective: Verify the agent reads repo instruction files first, follows the required tool order, and completes the first feasible action instead of stopping at a plan.
successCriteria:
- Agent reads the seeded instruction files before writing the requested artifact.
- Agent writes the requested artifact in the same run instead of returning only a plan.
- Agent does not ask for permission before the first feasible action.
- Final reply makes the completed read/write sequence explicit.
docsRefs:
- docs/help/testing.md
- docs/channels/qa-channel.md
codeRefs:
- src/agents/system-prompt.ts
- src/agents/pi-embedded-runner/run/incomplete-turn.ts
- extensions/qa-lab/src/mock-openai-server.ts
execution:
kind: flow
summary: Verify the agent reads repo instructions first, then completes the first bounded followthrough task without stalling.
config:
workspaceFiles:
AGENT.md: |-
# Repo contract
Step order:
1. Read AGENT.md.
2. Read SOUL.md.
3. Read FOLLOWTHROUGH_INPUT.md.
4. Write ./repo-contract-summary.txt.
5. Reply with three labeled lines exactly once: Read, Wrote, Status.
Do not stop after planning.
Do not ask for permission before the first feasible action.
SOUL.md: |-
# Execution style
Stay brief, honest, and action-first.
If the next tool action is feasible, do it before replying.
FOLLOWTHROUGH_INPUT.md: |-
Mission: prove you followed the repo contract.
Evidence path: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md -> repo-contract-summary.txt
prompt: |-
Repo contract followthrough check. Read AGENT.md, SOUL.md, and FOLLOWTHROUGH_INPUT.md first.
Then follow the repo contract exactly, write ./repo-contract-summary.txt, and reply with
three labeled lines: Read, Wrote, Status.
Do not stop after planning and do not ask for permission before the first feasible action.
expectedReplyAll:
- "read:"
- "wrote:"
- "status:"
forbiddenNeedles:
- need permission
- need your approval
- can you approve
- i would
- i can
- next i would
```
```yaml qa-flow
steps:
- name: follows repo instructions instead of stopping at a plan
actions:
- call: reset
- forEach:
items:
expr: "Object.entries(config.workspaceFiles ?? {})"
item: workspaceFile
actions:
- call: fs.writeFile
args:
- expr: "path.join(env.gateway.workspaceDir, String(workspaceFile[0]))"
- expr: "`${String(workspaceFile[1] ?? '').trimEnd()}\\n`"
- utf8
- set: artifactPath
value:
expr: "path.join(env.gateway.workspaceDir, 'repo-contract-summary.txt')"
- call: runAgentPrompt
args:
- ref: env
- sessionKey: agent:qa:repo-contract
message:
expr: config.prompt
timeoutMs:
expr: liveTurnTimeoutMs(env, 40000)
- call: waitForCondition
saveAs: artifact
args:
- lambda:
async: true
expr: "((await fs.readFile(artifactPath, 'utf8').catch(() => null))?.includes('Mission: prove you followed the repo contract.') ? await fs.readFile(artifactPath, 'utf8').catch(() => null) : undefined)"
- expr: liveTurnTimeoutMs(env, 30000)
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
- set: expectedReplyAll
value:
expr: config.expectedReplyAll.map(normalizeLowercaseStringOrEmpty)
- call: waitForCondition
saveAs: outbound
args:
- lambda:
expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && expectedReplyAll.every((needle) => normalizeLowercaseStringOrEmpty(candidate.text).includes(needle))).at(-1)"
- expr: liveTurnTimeoutMs(env, 30000)
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
- assert:
expr: "!config.forbiddenNeedles.some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
message:
expr: "`repo contract followthrough bounced for permission or stalled: ${outbound.text}`"
- set: followthroughDebugRequests
value:
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => /repo contract followthrough check/i.test(String(request.allInputText ?? ''))) : []"
- assert:
expr: "!env.mock || followthroughDebugRequests.filter((request) => request.plannedToolName === 'read').length >= 3"
message:
expr: "`expected three read tool calls before write, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
- assert:
expr: "!env.mock || followthroughDebugRequests.some((request) => request.plannedToolName === 'write')"
message:
expr: "`expected write tool call during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
- assert:
expr: "!env.mock || (() => { const readIndices = followthroughDebugRequests.map((r, i) => r.plannedToolName === 'read' ? i : -1).filter(i => i >= 0); const firstWrite = followthroughDebugRequests.findIndex((r) => r.plannedToolName === 'write'); return readIndices.length >= 3 && firstWrite >= 0 && readIndices[2] < firstWrite; })()"
message:
expr: "`expected all 3 reads before any write during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
detailsExpr: outbound.text
```

View File

@@ -1,5 +1,36 @@
# Memory recall after context switch
<!--
This scenario deliberately stays prose-only and does NOT gate on a
`/debug/requests` tool-call assertion, even though it is one of the
scenarios in the parity pack. The adversarial review in the umbrella
#64227 thread called this out as a coverage gap, but the underlying
behavior the scenario tests is legitimately prose-shaped: the agent is
supposed to pull a prior-turn fact ("ALPHA-7") back across an
intervening context switch and reply with the code. In a real
conversation, the model can do this EITHER by calling a memory-search
tool (which the qa-lab mock server doesn't currently expose) OR by
reading the fact directly from prior-turn context in its own
conversation window. Both strategies are valid parity behavior.
Forcing a `plannedToolName` assertion here would either require
extending the mock with a synthetic `memory_search` tool lane (PR O
scope, not PR J) or fabricating a tool-call requirement the real
providers never implement. Either path would make this scenario test
the harness, not the models. So we keep it prose-only, covered by the
`recallExpectedAny` / `rememberAckAny` assertions above, and flag the
exception explicitly rather than silently.
Criterion 2 of the parity completion gate (no fake progress or fake
tool completion) is enforced for this scenario through the parity
report's failure-tone fake-success detector: a scenario marked `pass`
whose details text matches patterns like "timed out", "failed to",
"could not" gets flagged via `SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS`
in `extensions/qa-lab/src/agentic-parity-report.ts`. Positive-tone
detection was removed because it false-positives on legitimate passes
where the details field is the model's outbound prose.
-->
```yaml qa-scenario
id: memory-recall
title: Memory recall after context switch

View File

@@ -69,13 +69,22 @@ steps:
expr: hasModelSwitchContinuityEvidence(outbound.text)
message:
expr: "`switch reply missed kickoff continuity: ${outbound.text}`"
- assert:
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName) === 'read')"
message:
expr: "`expected read after switch, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName ?? '')}`"
- assert:
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model) === 'gpt-5.4-alt')"
message:
expr: "`expected alternate model, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model ?? '')}`"
- if:
expr: "Boolean(env.mock)"
then:
- set: switchDebugRequests
value:
expr: "await fetchJson(`${env.mock.baseUrl}/debug/requests`)"
- set: switchRequest
value:
expr: "switchDebugRequests.find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))"
- assert:
expr: "switchRequest?.plannedToolName === 'read'"
message:
expr: "`expected read after switch, got ${String(switchRequest?.plannedToolName ?? '')}`"
- assert:
expr: "String(switchRequest?.model ?? '') === String(alternate?.model ?? '')"
message:
expr: "`expected alternate model, got ${String(switchRequest?.model ?? '')}`"
detailsExpr: outbound.text
```

View File

@@ -56,5 +56,20 @@ steps:
expr: "!reportsDiscoveryScopeLeak(outbound.text)"
message:
expr: "`discovery report drifted beyond scope: ${outbound.text}`"
# Parity gate criterion 2 (no fake progress / fake tool completion):
# require an actual read tool call before the prose report. Without this,
# a model could fabricate a plausible Worked/Failed/Blocked/Follow-up
# report without ever touching the repo files the prompt names. The
# debug request log is fetched once and reused for both the assertion
# and its failure-message diagnostic. Each request's allInputText is
# lowercased inline at match time (the real prompt writes it as
# "Worked, Failed, Blocked") so the contains check is case-insensitive.
- set: discoveryDebugRequests
value:
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
- assert:
expr: "!env.mock || discoveryDebugRequests.some((request) => String(request.allInputText ?? '').toLowerCase().includes('worked, failed, blocked') && request.plannedToolName === 'read')"
message:
expr: "`expected at least one read tool call during discovery report scenario, saw plannedToolNames=${JSON.stringify(discoveryDebugRequests.map((request) => request.plannedToolName ?? null))}`"
detailsExpr: outbound.text
```

View File

@@ -113,6 +113,28 @@ steps:
expr: "sawAlpha && sawBeta"
message:
expr: "`fanout child sessions missing (alpha=${String(sawAlpha)} beta=${String(sawBeta)})`"
# Tool-call assertion (criterion 2 of the
# parity completion gate in #64227): the
# scenario must have actually invoked
# `sessions_spawn` at least twice with
# distinct labels, not just ended up with
# two rows in the session store through
# prose trickery. The session store alone
# can be populated by other flows or by a
# model that fabricates "delegation"
# narration. `plannedToolName` on the
# mock's `/debug/requests` log is the
# tool-call ground truth: two recorded
# sessions_spawn requests with distinct
# labels means the model really dispatched
# both subagents.
- set: fanoutSpawnRequests
value:
expr: "[...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => request.plannedToolName === 'sessions_spawn' && /subagent fanout synthesis check/i.test(String(request.allInputText ?? '')))"
- assert:
expr: "fanoutSpawnRequests.length >= 2"
message:
expr: "`expected at least two sessions_spawn tool calls during subagent fanout scenario, saw ${fanoutSpawnRequests.length}`"
- set: details
value:
expr: "outbound.text"

View File

@@ -46,5 +46,25 @@ steps:
expr: "!['failed to delegate','could not delegate','subagent unavailable'].some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
message:
expr: "`subagent handoff reported failure: ${outbound.text}`"
# Parity gate criterion 2 (no fake progress / fake tool completion):
# require an actual sessions_spawn tool call. Without this, a model
# could produce the three labeled sections ("Delegated task", "Result",
# "Evidence") as free-form prose without ever delegating to a real
# subagent. The assertion is pinned to THIS scenario by matching the
# scenario-unique prompt substring "Delegate one bounded QA task"
# (not a broad /delegate|subagent/ regex) so the earlier
# subagent-fanout-synthesis scenario — which also contains "delegate"
# and produces its own pre-tool sessions_spawn request — cannot
# satisfy the assertion here. The match is also constrained to
# pre-tool requests (no toolOutput) because the mock only plans
# sessions_spawn on requests with no toolOutput; the follow-up
# request after the tool runs has plannedToolName unset.
- set: subagentDebugRequests
value:
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
- assert:
expr: "!env.mock || subagentDebugRequests.some((request) => !request.toolOutput && /delegate one bounded qa task/i.test(String(request.allInputText ?? '')) && request.plannedToolName === 'sessions_spawn')"
message:
expr: "`expected sessions_spawn tool call during subagent handoff scenario, saw plannedToolNames=${JSON.stringify(subagentDebugRequests.map((request) => request.plannedToolName ?? null))}`"
detailsExpr: outbound.text
```

View File

@@ -1 +1 @@
b92daceecab88cdb1ceeab30a7321399850a1fd13773af22dbb2035d39cdd5f8
1d087c0991987824d78c8ac4ec2c0e66d661f4bd4afd12b193d66634c69d75a0