qa: salvage GPT-5.4 parity proof slice (#65664)

* test(qa): gate parity prose scenarios on real tool calls Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no fake progress / fake tool completion') for the two first/second-wave parity scenarios that can currently pass with a prose-only reply. Background: the scenario framework already exposes tool-call assertions via /debug/requests on the mock server (see approval-turn-tool-followthrough for the pattern). Most parity scenarios use this seam to require a specific plannedToolName, but source-docs-discovery-report and subagent-handoff only checked the assistant's prose text, which means a model could fabricate: - a Worked / Failed / Blocked / Follow-up report without ever calling the read tool on the docs / source files the prompt named - three labeled 'Delegated task', 'Result', 'Evidence' sections without ever calling sessions_spawn to delegate Both gaps are fake-progress loopholes for the parity gate. Changes: - source-docs-discovery-report: require at least one read tool call tied to the 'worked, failed, blocked' prompt in /debug/requests. Failure message dumps the observed plannedToolName list for debugging. - subagent-handoff: require at least one sessions_spawn tool call tied to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same debug-friendly failure message. Both assertions are gated behind !env.mock so they no-op in live-frontier mode where the real provider exposes plannedToolName through a different channel (or not at all). Not touched: memory-recall is also in the parity pack but its pass path is legitimately 'read the fact from prior-turn context'. That is a valid recall strategy, not fake progress, so it is out of scope for this PR. memory-recall's fake-progress story (no real memory_search call) would require bigger mock-server changes and belongs in a follow-up that extends the mock memory pipeline. Validation: - pnpm test extensions/qa-lab/src/scenario-catalog.test.ts Refs #64227 * test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch Addresses loop-6 review feedback on PR #64681: 1. Copilot / Greptile / codex-connector all flagged that the discovery scenario's .includes('worked, failed, blocked') assertion is case-sensitive but the real prompt says 'Worked, Failed, Blocked...', so the mock-mode assertion never matches. Fix: lowercase-normalize allInputText before the contains check. 2. Greptile P2: the expr and message.expr each called fetchJson separately, incurring two round-trips to /debug/requests. Fix: hoist the fetch to a set step (discoveryDebugRequests / subagentDebugRequests) and reuse the snapshot. 3. Copilot: the subagent-handoff assertion scanned the entire request log and matched the first request with 'delegate' in its input text, which could false-pass on a stale prior scenario. Fix: reverse the array and take the most recent matching request instead. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I used on pass 1 usually lands on the FOLLOW-UP request after the mock runs sessions_spawn, not the pre-tool planning request that actually has plannedToolName === 'sessions_spawn'. The mock only plans that tool on requests with !toolOutput (mock-openai-server.ts:662), so the post-tool request has plannedToolName unset and the assertion fails even when the handoff succeeded. Fix: switch the assertion back to a forward .some() match but add a !request.toolOutput filter so the match is pinned to the pre-tool planning phase. The case-insensitive regex, the fetchJson dedupe, and the failure-message diagnostic from pass 1 are unchanged. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): pin subagent-handoff tool-call assertion to scenario prompt Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix filtered to pre-tool requests but still used a broad `/delegate|subagent handoff/i` regex. The `subagent-fanout-synthesis` scenario runs BEFORE `subagent-handoff` in catalog order (scenarios are sorted by path), and the fanout prompt reads 'Subagent fanout synthesis check: delegate exactly two bounded subagents sequentially' — which contains 'delegate' and also plans sessions_spawn pre-tool. That produces a cross-scenario false pass where the fanout's earlier sessions_spawn request satisfies the handoff assertion even when the handoff run never delegates. Fix: tighten the input-text match from `/delegate|subagent handoff/i` to `/delegate one bounded qa task/i`, which is the exact scenario- unique substring from the `subagent-handoff` config.prompt. That pins the assertion to this scenario's request window and closes the cross-scenario false positive. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): align parity assertion comments with actual filter logic Addresses two loop-7 Copilot findings on PR #64681: 1. source-docs-discovery-report.md: the explanatory comment said the debug request log was 'lowercased for case-insensitive matching', but the code actually lowercases each request's allInputText inline inside the .some() predicate, not the discoveryDebugRequests snapshot. Rewrite the comment to describe the inline-lowercase pattern so a future reader matches the code they see. 2. subagent-handoff.md: the comment said the assertion 'must be pinned to THIS scenario's request window' but the implementation actually relies on matching a scenario-unique prompt substring (/delegate one bounded qa task/i), not a request-window. Rewrite the comment to describe the substring pinning and keep the pre-tool filter rationale intact. No runtime change; comment-only fix to keep reviewer expectations aligned with the actual assertion shape. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios * Guard mock-only image parity assertions * Expand agentic parity second wave * test(qa): pad parity suspicious-pass isolation to second wave * qa-lab: parametrize parity report title and drop stale first-wave comment Addresses two loop-7 Copilot findings on PR #64662: 1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a template string that interpolates candidateLabel and baselineLabel, so any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate title in saved reports. Default CLI flags still produce openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair. 2. Stale 'declared first-wave parity scenarios' comment in scopeSummaryToParityPack: the parity pack is now the ten-scenario first-wave+second-wave set (PR D + PR E). Comment updated to drop the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS constant the scope is filtering against. New regression: 'parametrizes the markdown header from the comparison labels' — asserts that non-default labels (openai/gpt-5.4-alt vs openai/gpt-5.4) render in the H1. Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts (13/13 pass). Refs #64227 * qa-lab: fail parity gate on required scenario failures regardless of baseline parity * test(qa): update readable-report test to cover all 10 parity scenarios * qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels * Tighten parity label and scenario checks * fix: tighten parity label provenance checks * fix: scope parity tool-call metrics to tool lanes * Fix parity report label and fake-success checks * fix(qa): tighten parity report edge cases * qa-lab: add Anthropic /v1/messages mock route for parity baseline Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs #64227 Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path. * qa-lab: fix Anthropic tool_result ordering in messages adapter Addresses the loop-6 Copilot / Greptile finding on PR #64685: in `convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were pushed to `items` inside the per-block loop while the surrounding user/assistant message was only pushed after the loop finished. That reordered the function_call_output BEFORE its parent user message whenever a user turn mixed `tool_result` with fresh text/image blocks, which broke `extractToolOutput` (it scans AFTER the last user-role index; function_call_output placed BEFORE that index is invisible to it) and made the downstream scenario dispatcher behave as if no tool output had been returned on mixed-content turns. Fix: buffer `tool_result` and `tool_use` blocks in local arrays during the per-block loop, push the parent role message first (when it has any text/image pieces), then push the accumulated function_call / function_call_output items in original order. tool_result-only user turns still omit the parent message as before, so the non-mixed subagent-fanout-synthesis two-stage flow that already worked keeps working. Regression added: - `places tool_result after the parent user message even in mixed-content turns` — sends a user turn that mixes a `tool_result` block with a trailing fresh text block, then inspects `/debug/last-request` to assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found the function_call_output AFTER the last user index) and `prompt === 'Keep going with the fanout.'` (extractLastUserText picked up the trailing fresh text). Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (19/19 pass). Refs #64227 * qa-lab: reject Anthropic streaming and empty model in messages mock * qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider * Handle invalid Anthropic mock JSON * fix: wire mock parity providers by model ref * fix(qa): support Anthropic message streaming in mock parity lane * qa-lab: record provider/model/mode in qa-suite-summary.json Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in #64227. Background: the parity gate in #64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs #64227 Unblocks the final parity run for #64441 / #64662 by making summaries self-describing. * qa-lab: strengthen qa-suite-summary builder types and empty-array semantics Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689 (re-opened as #64789): 1. P2 codex + Copilot: empty `scenarioIds` array was serialized as `[]` because of a truthiness check. The CLI passes an empty array when --scenario is omitted, so full-suite runs would incorrectly record an explicit empty selection. Fix: switch to a `length > 0` check so '[] or undefined' both encode as `null` in the summary run metadata. 2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate consumers but its return type was `Record<string, unknown>`, which defeated the point of exporting it. Fix: introduce a concrete `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and make the builder return it. Downstream code (parity gate, parity run wrapper) can now import the type and keep consumers type-checked. 3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the `'mock-openai' | 'live-frontier'` string union even though `QaProviderMode` is already imported from model-selection.ts. Fix: reuse `QaProviderMode` so provider-mode additions flow through both types at once. 4. Copilot: test fixtures omitted `steps` from the fake scenario results, creating shape drift with the real suite scenario-result shape. Fix: pad the test fixtures with `steps: []` and tighten the scenarioIds assertion to read `json.run.scenarioIds` directly (the new concrete return type makes the type-cast unnecessary). New regression: `treats an empty scenarioIds array as unspecified (no filter)` — passes `scenarioIds: []` and asserts the summary records `scenarioIds: null`. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: record executed scenarioIds in summary run metadata Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689): `run.scenarioIds` was copied from the raw `params.scenarioIds` caller input, but `runQaSuite` normalizes that input through `selectQaSuiteScenarios` which dedupes via `Set` and reorders the selection to catalog order. When callers repeat --scenario ids or pass them in non-catalog order, the summary metadata drifted from the scenarios actually executed, which can make parity/report tooling treat equivalent runs as different or trust inaccurate provenance. Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass `selectedCatalogScenarios.map(scenario => scenario.id)` instead of `params?.scenarioIds`, so the summary records the post-selection executed list. This also covers the full-suite case automatically (the executed list is the full lane-filtered catalog), giving parity consumers a stable record of exactly which scenarios landed in the run regardless of how the caller phrased the request. buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2 semantics are preserved so the public helper still treats an empty array as 'unspecified' for any future caller that legitimately passes one. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: preserve null scenarioIds for unfiltered suite runs Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix always passed `selectedCatalogScenarios.map(...)` to writeQaSuiteArtifacts, which made unfiltered full-suite runs indistinguishable from an explicit all-scenarios selection in the summary metadata. The 'unfiltered → null' semantic (documented in the buildQaSuiteSummaryJson JSDoc and exercised by the "treats an empty scenarioIds array as unspecified" regression) was lost. Fix: both writeQaSuiteArtifacts call sites now condition on the caller's original `params.scenarioIds`. When the caller passed an explicit non-empty filter, record the post-selection executed list (pass-3 behavior, preserving Set-dedupe + catalog-order normalization). When the caller passed undefined or an empty array, pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's length-check serializes null (pass-2 behavior, preserving unfiltered semantics). This keeps both codex-connector findings satisfied simultaneously: - explicit --scenario filter reorders/dedupes through the executed list, not the raw caller input - unfiltered full-suite run records null, not a full catalog dump that would shadow "explicit all-scenarios" selections Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type * qa-lab: stage mock auth profiles so the parity gate runs without real credentials * fix(qa): clean up mock auth staging follow-ups * ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock * ci: use supported parity gate runner label * ci: watch gateway changes in parity gate * docs: pin parity runbook alternate models * fix(ci): watch qa-channel parity inputs * qa: roll up parity proof closeout * qa: harden mock parity review fixes * qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection * qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch * qa-lab: clean up positive-tone comment + fix stale test expectations * qa: pin workflow Node version to 22.14.0 + fix stale label-match wording * qa-lab: refresh mock provider routing expectation * docs: drop stale parity rollup rewrite from proof slice * qa: run parity gate against mock lane * deps: sync qa-lab lockfile * build: refresh a2ui bundle hash * ci: widen parity gate triggers --------- Co-authored-by: Eva <eva@100yen.org>
2026-04-18 15:23:23 +02:00 · 2026-04-12 21:01:54 -07:00
parent 3d07dfbb65
commit b13844732e
23 changed files with 3228 additions and 255 deletions
--- a/.github/workflows/parity-gate.yml
+++ b/.github/workflows/parity-gate.yml
@@ -0,0 +1,93 @@
+name: Parity gate
+
+on:
+  pull_request:
+    types: [opened, reopened, synchronize, ready_for_review]
+    paths:
+      - "extensions/qa-lab/**"
+      - "extensions/qa-channel/**"
+      - "extensions/openai/**"
+      - "qa/scenarios/**"
+      - "src/agents/**"
+      - "src/context-engine/**"
+      - "src/gateway/**"
+      - "src/media/**"
+      - ".github/workflows/parity-gate.yml"
+
+permissions:
+  contents: read
+
+concurrency:
+  group: parity-gate-${{ github.event.pull_request.number || github.sha }}
+  cancel-in-progress: true
+
+jobs:
+  parity-gate:
+    name: Run the GPT-5.4 / Opus 4.6 parity gate against the qa-lab mock
+    if: ${{ github.event.pull_request.draft != true }}
+    runs-on: blacksmith-8vcpu-ubuntu-2404
+    timeout-minutes: 20
+    env:
+      # Fence the gate off from any real provider credentials. The qa-lab
+      # mock server + auth staging (PR N) should be enough to produce a
+      # meaningful verdict without touching a real API. If any of these
+      # leak into the job env, fail hard instead of silently running
+      # against a live provider and burning real budget.
+      OPENAI_API_KEY: ""
+      ANTHROPIC_API_KEY: ""
+      OPENCLAW_LIVE_OPENAI_KEY: ""
+      OPENCLAW_LIVE_ANTHROPIC_KEY: ""
+      OPENCLAW_LIVE_GEMINI_KEY: ""
+      OPENCLAW_LIVE_SETUP_TOKEN_VALUE: ""
+    steps:
+      - name: Checkout PR
+        uses: actions/checkout@v4
+
+      - name: Install pnpm
+        uses: pnpm/action-setup@v4
+
+      - name: Setup Node
+        uses: actions/setup-node@v4
+        with:
+          node-version: "22.14.0"
+          cache: "pnpm"
+
+      - name: Install dependencies
+        run: pnpm install --frozen-lockfile
+
+      - name: Run GPT-5.4 lane
+        run: |
+          pnpm openclaw qa suite \
+            --provider-mode mock-openai \
+            --parity-pack agentic \
+            --model openai/gpt-5.4 \
+            --alt-model openai/gpt-5.4-alt \
+            --output-dir .artifacts/qa-e2e/gpt54
+
+      - name: Run Opus 4.6 lane
+        run: |
+          pnpm openclaw qa suite \
+            --provider-mode mock-openai \
+            --parity-pack agentic \
+            --model anthropic/claude-opus-4-6 \
+            --alt-model anthropic/claude-sonnet-4-6 \
+            --output-dir .artifacts/qa-e2e/opus46
+
+      - name: Generate parity report
+        run: |
+          pnpm openclaw qa parity-report \
+            --repo-root . \
+            --candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
+            --baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
+            --candidate-label openai/gpt-5.4 \
+            --baseline-label anthropic/claude-opus-4-6 \
+            --output-dir .artifacts/qa-e2e/parity
+
+      - name: Upload parity artifacts
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: parity-gate-${{ github.event.pull_request.number || github.sha }}
+          path: .artifacts/qa-e2e/
+          retention-days: 14
+          if-no-files-found: warn
--- a/extensions/qa-lab/src/agentic-parity-report.test.ts
+++ b/extensions/qa-lab/src/agentic-parity-report.test.ts
@@ -2,16 +2,42 @@ import { describe, expect, it } from "vitest";
 import {
  buildQaAgenticParityComparison,
  computeQaAgenticParityMetrics,
+  QaParityLabelMismatchError,
  renderQaAgenticParityMarkdownReport,
+  type QaParityReportScenario,
  type QaParitySuiteSummary,
 } from "./agentic-parity-report.js";

+const FULL_PARITY_PASS_SCENARIOS: QaParityReportScenario[] = [
+  { name: "Approval turn tool followthrough", status: "pass" as const },
+  { name: "Compaction retry after mutating tool", status: "pass" as const },
+  { name: "Model switch with tool continuity", status: "pass" as const },
+  { name: "Source and docs discovery report", status: "pass" as const },
+  { name: "Image understanding from attachment", status: "pass" as const },
+  { name: "Subagent handoff", status: "pass" as const },
+  { name: "Subagent fanout synthesis", status: "pass" as const },
+  { name: "Memory recall after context switch", status: "pass" as const },
+  { name: "Thread memory isolation", status: "pass" as const },
+  { name: "Config restart capability flip", status: "pass" as const },
+  { name: "Instruction followthrough repo contract", status: "pass" as const },
+];
+
+function withScenarioOverride(name: string, override: Partial<QaParityReportScenario>) {
+  return FULL_PARITY_PASS_SCENARIOS.map((scenario) =>
+    scenario.name === name ? { ...scenario, ...override } : scenario,
+  );
+}
+
 describe("qa agentic parity report", () => {
  it("computes first-wave parity metrics from suite summaries", () => {
    const summary: QaParitySuiteSummary = {
      scenarios: [
-        { name: "Scenario A", status: "pass" },
-        { name: "Scenario B", status: "fail", details: "incomplete turn detected" },
+        { name: "Approval turn tool followthrough", status: "pass" },
+        {
+          name: "Compaction retry after mutating tool",
+          status: "fail",
+          details: "incomplete turn detected",
+        },
      ],
    };

@@ -28,6 +54,23 @@ describe("qa agentic parity report", () => {
    });
  });

+  it("keeps non-tool scenarios out of the valid-tool-call metric", () => {
+    const summary: QaParitySuiteSummary = {
+      scenarios: [
+        { name: "Approval turn tool followthrough", status: "pass" },
+        { name: "Memory recall after context switch", status: "pass" },
+        { name: "Image understanding from attachment", status: "pass" },
+      ],
+    };
+
+    expect(computeQaAgenticParityMetrics(summary)).toMatchObject({
+      totalScenarios: 3,
+      passedScenarios: 3,
+      validToolCallCount: 1,
+      validToolCallRate: 1,
+    });
+  });
+
  it("fails the parity gate when the candidate regresses against baseline", () => {
    const comparison = buildQaAgenticParityComparison({
      candidateLabel: "openai/gpt-5.4",
@@ -207,33 +250,70 @@ describe("qa agentic parity report", () => {
    );
  });

+  it("fails the parity gate when a required parity scenario fails on both sides", () => {
+    // Regression for the loop-7 Codex-connector P1 finding: without this
+    // check, a required parity scenario that fails on both candidate and
+    // baseline still produces pass=true because the downstream metric
+    // comparisons are purely relative (candidate vs baseline). Cover the
+    // whole parity pack as pass on both sides except the one scenario we
+    // deliberately fail on both sides, so the assertion can pin the
+    // isolated gate failure under test.
+    const scenariosWithBothFail = withScenarioOverride("Approval turn tool followthrough", {
+      status: "fail",
+    });
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "openai/gpt-5.4",
+      baselineLabel: "anthropic/claude-opus-4-6",
+      candidateSummary: { scenarios: scenariosWithBothFail },
+      baselineSummary: { scenarios: scenariosWithBothFail },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });
+
+    expect(comparison.pass).toBe(false);
+    expect(comparison.failures).toContain(
+      "Required parity scenario Approval turn tool followthrough failed: openai/gpt-5.4=fail, anthropic/claude-opus-4-6=fail.",
+    );
+    // Metric comparisons are relative, so a same-on-both-sides failure
+    // must not appear as a relative metric failure. The required-scenario
+    // failure line is the only thing keeping the gate honest here.
+    expect(comparison.failures.some((failure) => failure.includes("completion rate"))).toBe(false);
+  });
+
+  it("fails the parity gate when a required parity scenario fails on the candidate only", () => {
+    // A candidate regression below a passing baseline is already caught
+    // by the relative completion-rate comparison, but surface it as a
+    // named required-scenario failure too so operators see a concrete
+    // scenario name alongside the rate differential.
+    const candidateWithOneFail = withScenarioOverride("Approval turn tool followthrough", {
+      status: "fail",
+    });
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "openai/gpt-5.4",
+      baselineLabel: "anthropic/claude-opus-4-6",
+      candidateSummary: { scenarios: candidateWithOneFail },
+      baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });
+
+    expect(comparison.pass).toBe(false);
+    expect(comparison.failures).toContain(
+      "Required parity scenario Approval turn tool followthrough failed: openai/gpt-5.4=fail, anthropic/claude-opus-4-6=pass.",
+    );
+  });
+
  it("fails the parity gate when the baseline contains suspicious pass results", () => {
-    // Cover the full first-wave pack on both sides so the suspicious-pass assertion
+    // Cover the full second-wave pack on both sides so the suspicious-pass assertion
    // below is the isolated gate failure under test (no coverage-gap noise).
    const comparison = buildQaAgenticParityComparison({
      candidateLabel: "openai/gpt-5.4",
      baselineLabel: "anthropic/claude-opus-4-6",
      candidateSummary: {
-        scenarios: [
-          { name: "Approval turn tool followthrough", status: "pass" },
-          { name: "Compaction retry after mutating tool", status: "pass" },
-          { name: "Model switch with tool continuity", status: "pass" },
-          { name: "Source and docs discovery report", status: "pass" },
-          { name: "Image understanding from attachment", status: "pass" },
-        ],
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
      },
      baselineSummary: {
-        scenarios: [
-          {
-            name: "Approval turn tool followthrough",
-            status: "pass",
-            details: "timed out before it continued",
-          },
-          { name: "Compaction retry after mutating tool", status: "pass" },
-          { name: "Model switch with tool continuity", status: "pass" },
-          { name: "Source and docs discovery report", status: "pass" },
-          { name: "Image understanding from attachment", status: "pass" },
-        ],
+        scenarios: withScenarioOverride("Approval turn tool followthrough", {
+          details: "timed out before it continued",
+        }),
      },
      comparedAt: "2026-04-11T00:00:00.000Z",
    });
@@ -303,36 +383,333 @@ Follow-up:
    expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(1);
  });

-  it("renders a readable markdown parity report", () => {
+  it("does not flag positive-tone prose as fake success (positive-tone detection removed)", () => {
+    // Positive-tone detection was removed because for passing runs the
+    // `details` field is the model's prose, which never contains tool-call
+    // evidence. Criterion 2 is enforced by per-scenario tool-call assertions.
+    const summary: QaParitySuiteSummary = {
+      scenarios: [
+        {
+          name: "Subagent handoff",
+          status: "pass",
+          details: "Successfully completed the delegation. The subagent returned its result.",
+        },
+      ],
+    };
+
+    expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
+  });
+
+  it("does not flag bare 'Done.' prose as fake success", () => {
+    const summary: QaParitySuiteSummary = {
+      scenarios: [
+        {
+          name: "Approval turn tool followthrough",
+          status: "pass",
+          details: "Done.",
+        },
+      ],
+    };
+
+    expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
+  });
+
+  it("does not flag structured status lines that end in `done`", () => {
+    const summary: QaParitySuiteSummary = {
+      scenarios: [
+        {
+          name: "Compaction retry after mutating tool",
+          status: "pass",
+          details: `Confirmed, replay unsafe after write.
+compactionCount=0
+status=done`,
+        },
+      ],
+    };
+
+    expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
+  });
+
+  it("does not flag positive-tone passes when the scenario shows real tool-call evidence", () => {
+    // A legitimate tool-mediated pass that happens to include
+    // "successfully" in its prose must not be flagged. The
+    // `plannedToolName` evidence (or any of the other tool-call
+    // evidence patterns) exempts the scenario from positive-tone
+    // detection. Without this exemption, real tool-backed passes with
+    // self-congratulatory prose would count as fake successes and break
+    // the gate.
+    const summary: QaParitySuiteSummary = {
+      scenarios: [
+        {
+          name: "Source and docs discovery report",
+          status: "pass",
+          details:
+            "Successfully completed the report. plannedToolName=read recorded via /debug/requests.",
+        },
+      ],
+    };
+
+    expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
+  });
+
+  it("only flags failure-tone passes, not positive-tone", () => {
+    const summary: QaParitySuiteSummary = {
+      scenarios: [
+        {
+          name: "Approval turn tool followthrough",
+          status: "pass",
+          details: "Task executed successfully without errors.",
+        },
+        {
+          name: "Subagent handoff",
+          status: "pass",
+          details: "Tool call completed, but an error occurred mid-turn.",
+        },
+      ],
+    };
+
+    // Only the failure-tone scenario ("error occurred") counts.
+    // The positive-tone one ("successfully") is not flagged.
+    expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(1);
+  });
+
+  it("throws QaParityLabelMismatchError when the candidate run.primaryProvider does not match the label", () => {
+    // Regression for the gate footgun: if an operator swaps the
+    // --candidate-summary and --baseline-summary paths, the gate would
+    // silently produce a reversed verdict. PR L #64789 ships the `run`
+    // block on every summary so the parity report can verify it against
+    // the caller-supplied label; this test pins the precondition check.
+    const parityPassScenarios = [
+      { name: "Approval turn tool followthrough", status: "pass" as const },
+      { name: "Compaction retry after mutating tool", status: "pass" as const },
+      { name: "Model switch with tool continuity", status: "pass" as const },
+      { name: "Source and docs discovery report", status: "pass" as const },
+      { name: "Image understanding from attachment", status: "pass" as const },
+    ];
+
+    expect(() =>
+      buildQaAgenticParityComparison({
+        candidateLabel: "openai/gpt-5.4",
+        baselineLabel: "anthropic/claude-opus-4-6",
+        candidateSummary: {
+          scenarios: parityPassScenarios,
+          run: { primaryProvider: "anthropic", primaryModel: "claude-opus-4-6" },
+        },
+        baselineSummary: {
+          scenarios: parityPassScenarios,
+          run: { primaryProvider: "anthropic", primaryModel: "claude-opus-4-6" },
+        },
+        comparedAt: "2026-04-11T00:00:00.000Z",
+      }),
+    ).toThrow(QaParityLabelMismatchError);
+  });
+
+  it("throws QaParityLabelMismatchError when the baseline run.primaryProvider does not match the label", () => {
+    const parityPassScenarios = [
+      { name: "Approval turn tool followthrough", status: "pass" as const },
+    ];
+
+    expect(() =>
+      buildQaAgenticParityComparison({
+        candidateLabel: "openai/gpt-5.4",
+        baselineLabel: "anthropic/claude-opus-4-6",
+        candidateSummary: {
+          scenarios: parityPassScenarios,
+          run: { primaryProvider: "openai" },
+        },
+        baselineSummary: {
+          scenarios: parityPassScenarios,
+          run: { primaryProvider: "openai", primaryModel: "gpt-5.4" },
+        },
+        comparedAt: "2026-04-11T00:00:00.000Z",
+      }),
+    ).toThrow(
+      /baseline summary run\.primaryProvider=openai and run\.primaryModel=gpt-5\.4 do not match --baseline-label/,
+    );
+  });
+
+  it("accepts matching run.primaryProvider labels without throwing", () => {
    const comparison = buildQaAgenticParityComparison({
      candidateLabel: "openai/gpt-5.4",
      baselineLabel: "anthropic/claude-opus-4-6",
      candidateSummary: {
-        scenarios: [
-          { name: "Approval turn tool followthrough", status: "pass" },
-          { name: "Compaction retry after mutating tool", status: "pass" },
-          { name: "Model switch with tool continuity", status: "pass" },
-          { name: "Source and docs discovery report", status: "pass" },
-          { name: "Image understanding from attachment", status: "pass" },
-        ],
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "openai",
+          primaryModel: "openai/gpt-5.4",
+          primaryModelName: "gpt-5.4",
+        },
      },
      baselineSummary: {
-        scenarios: [
-          { name: "Approval turn tool followthrough", status: "pass" },
-          { name: "Compaction retry after mutating tool", status: "pass" },
-          { name: "Model switch with tool continuity", status: "pass" },
-          { name: "Source and docs discovery report", status: "pass" },
-          { name: "Image understanding from attachment", status: "pass" },
-        ],
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "anthropic",
+          primaryModel: "anthropic/claude-opus-4-6",
+          primaryModelName: "claude-opus-4-6",
+        },
      },
      comparedAt: "2026-04-11T00:00:00.000Z",
    });
+    expect(comparison.pass).toBe(true);
+  });
+
+  it("skips run.primaryProvider verification when the summary is missing a run block (legacy summaries)", () => {
+    // Pre-PR-L summaries don't carry a `run` block. The gate must still
+    // work against those, trusting the caller-supplied label.
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "openai/gpt-5.4",
+      baselineLabel: "anthropic/claude-opus-4-6",
+      candidateSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
+      baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });
+    expect(comparison.pass).toBe(true);
+  });
+
+  it("skips provider verification for arbitrary display labels when run metadata is present", () => {
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "GPT-5.4 candidate",
+      baselineLabel: "Opus 4.6 baseline",
+      candidateSummary: {
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "openai",
+          primaryModel: "openai/gpt-5.4",
+          primaryModelName: "gpt-5.4",
+        },
+      },
+      baselineSummary: {
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "anthropic",
+          primaryModel: "anthropic/claude-opus-4-6",
+          primaryModelName: "claude-opus-4-6",
+        },
+      },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });
+
+    expect(comparison.pass).toBe(true);
+  });
+
+  it("skips provider verification for mixed-case or decorated display labels", () => {
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "Candidate: GPT-5.4",
+      baselineLabel: "Opus 4.6 / baseline",
+      candidateSummary: {
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "openai",
+          primaryModel: "openai/gpt-5.4",
+          primaryModelName: "gpt-5.4",
+        },
+      },
+      baselineSummary: {
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "anthropic",
+          primaryModel: "anthropic/claude-opus-4-6",
+          primaryModelName: "claude-opus-4-6",
+        },
+      },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });
+
+    expect(comparison.pass).toBe(true);
+  });
+
+  it("throws when a structured label mismatches the recorded model even if the provider matches", () => {
+    expect(() =>
+      buildQaAgenticParityComparison({
+        candidateLabel: "openai/gpt-5.4",
+        baselineLabel: "anthropic/claude-opus-4-6",
+        candidateSummary: {
+          scenarios: FULL_PARITY_PASS_SCENARIOS,
+          run: {
+            primaryProvider: "openai",
+            primaryModel: "openai/gpt-5.4-alt",
+            primaryModelName: "gpt-5.4-alt",
+          },
+        },
+        baselineSummary: {
+          scenarios: FULL_PARITY_PASS_SCENARIOS,
+          run: {
+            primaryProvider: "anthropic",
+            primaryModel: "anthropic/claude-opus-4-6",
+            primaryModelName: "claude-opus-4-6",
+          },
+        },
+        comparedAt: "2026-04-11T00:00:00.000Z",
+      }),
+    ).toThrow(
+      /candidate summary run\.primaryProvider=openai and run\.primaryModel=openai\/gpt-5\.4-alt do not match --candidate-label=openai\/gpt-5\.4/,
+    );
+  });
+
+  it("accepts colon-delimited structured labels when provider and model both match", () => {
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "openai:gpt-5.4",
+      baselineLabel: "anthropic:claude-opus-4-6",
+      candidateSummary: {
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "openai",
+          primaryModel: "openai/gpt-5.4",
+          primaryModelName: "gpt-5.4",
+        },
+      },
+      baselineSummary: {
+        scenarios: FULL_PARITY_PASS_SCENARIOS,
+        run: {
+          primaryProvider: "anthropic",
+          primaryModel: "anthropic/claude-opus-4-6",
+          primaryModelName: "claude-opus-4-6",
+        },
+      },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });
+
+    expect(comparison.pass).toBe(true);
+  });
+
+  it("renders a readable markdown parity report", () => {
+    // Cover the full parity pack on both sides so the pass
+    // verdict is not disrupted by required-scenario coverage failures
+    // added by the second-wave expansion.
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "openai/gpt-5.4",
+      baselineLabel: "anthropic/claude-opus-4-6",
+      candidateSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
+      baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });

    const report = renderQaAgenticParityMarkdownReport(comparison);

-    expect(report).toContain("# OpenClaw GPT-5.4 / Opus 4.6 Agentic Parity Report");
+    expect(report).toContain(
+      "# OpenClaw Agentic Parity Report — openai/gpt-5.4 vs anthropic/claude-opus-4-6",
+    );
    expect(report).toContain("| Completion rate | 100.0% | 100.0% |");
    expect(report).toContain("### Approval turn tool followthrough");
    expect(report).toContain("- Verdict: pass");
  });
+
+  it("parametrizes the markdown header from the comparison labels", () => {
+    // Regression for the loop-7 Copilot finding: callers that configure
+    // non-gpt-5.4 / non-opus labels (for example an internal candidate vs
+    // another candidate) must see the labels in the rendered H1 instead of
+    // the hardcoded "GPT-5.4 / Opus 4.6" title that would otherwise confuse
+    // readers of saved reports.
+    const comparison = buildQaAgenticParityComparison({
+      candidateLabel: "openai/gpt-5.4-alt",
+      baselineLabel: "openai/gpt-5.4",
+      candidateSummary: { scenarios: [] },
+      baselineSummary: { scenarios: [] },
+      comparedAt: "2026-04-11T00:00:00.000Z",
+    });
+    const report = renderQaAgenticParityMarkdownReport(comparison);
+    expect(report).toContain(
+      "# OpenClaw Agentic Parity Report — openai/gpt-5.4-alt vs openai/gpt-5.4",
+    );
+  });
 });
--- a/extensions/qa-lab/src/agentic-parity-report.ts
+++ b/extensions/qa-lab/src/agentic-parity-report.ts
@@ -1,4 +1,7 @@
-import { QA_AGENTIC_PARITY_SCENARIO_TITLES } from "./agentic-parity.js";
+import {
+  QA_AGENTIC_PARITY_SCENARIO_TITLES,
+  QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES,
+} from "./agentic-parity.js";

 export type QaParityReportStep = {
  name: string;
@@ -13,6 +16,21 @@ export type QaParityReportScenario = {
  steps?: QaParityReportStep[];
 };

+/**
+ * Optional self-describing run metadata written by PR L (#64789). Before
+ * that PR merges, older summaries only have `scenarios` + `counts`; the
+ * parity report treats a missing `run` block as "unknown provenance" and
+ * skips the label-match verification for backwards compatibility
+ * with legacy summaries that predate the run metadata block.
+ */
+export type QaParityRunBlock = {
+  primaryProvider?: string;
+  primaryModel?: string;
+  primaryModelName?: string;
+  providerMode?: string;
+  scenarioIds?: readonly string[] | null;
+};
+
 export type QaParitySuiteSummary = {
  scenarios: QaParityReportScenario[];
  counts?: {
@@ -20,6 +38,8 @@ export type QaParitySuiteSummary = {
    passed?: number;
    failed?: number;
  };
+  /** Self-describing run metadata — see PR L #64789 for the writer side. */
+  run?: QaParityRunBlock;
 };

 export type QaAgenticParityMetrics = {
@@ -64,7 +84,11 @@ const UNINTENDED_STOP_PATTERNS = [
  /did not continue/i,
 ] as const;

-const SUSPICIOUS_PASS_PATTERNS = [
+// Failure-tone patterns: a passing scenario whose details text matches any
+// of these is treated as a "fake success" — the scenario is marked pass but
+// the supporting text reveals something went wrong. Adding new patterns here
+// widens the net for bad prose that correlates with runtime failure modes.
+const SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS = [
  /incomplete turn/i,
  /\btimed out\b/i,
  /\btimeout\b/i,
@@ -76,6 +100,13 @@ const SUSPICIOUS_PASS_PATTERNS = [
  /an error was/i,
 ] as const;

+// Positive-tone patterns (e.g. "Successfully completed", "Done.") are NOT
+// checked in fakeSuccessCount. For passing runs, `details` is the model's
+// outbound prose, which never contains tool-call evidence strings, so a
+// tool-call-evidence exemption would false-positive on every legitimate
+// pass. Criterion 2 ("no fake progress") is enforced by per-scenario
+// `/debug/requests` tool-call assertions in the YAML flows (PR J) instead.
+
 function normalizeScenarioStatus(status: string | undefined): "pass" | "fail" | "skip" {
  return status === "pass" || status === "fail" || status === "skip" ? status : "fail";
 }
@@ -103,6 +134,9 @@ export function computeQaAgenticParityMetrics(
    ...scenario,
    status: normalizeScenarioStatus(scenario.status),
  }));
+  const toolBackedTitleSet: ReadonlySet<string> = new Set(
+    QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES,
+  );
  const totalScenarios = summary.counts?.total ?? scenarios.length;
  const passedScenarios =
    summary.counts?.passed ?? scenarios.filter((scenario) => scenario.status === "pass").length;
@@ -112,16 +146,40 @@ export function computeQaAgenticParityMetrics(
    (scenario) =>
      scenario.status !== "pass" && scenarioHasPattern(scenario, UNINTENDED_STOP_PATTERNS),
  ).length;
-  const fakeSuccessCount = scenarios.filter(
-    (scenario) =>
-      scenario.status === "pass" && scenarioHasPattern(scenario, SUSPICIOUS_PASS_PATTERNS),
+  const fakeSuccessCount = scenarios.filter((scenario) => {
+    if (scenario.status !== "pass") {
+      return false;
+    }
+    // Failure-tone patterns catch obviously-broken passes regardless of
+    // whether the scenario shows tool-call evidence — "timed out" under a
+    // pass is always fake.
+    if (scenarioHasPattern(scenario, SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS)) {
+      return true;
+    }
+    // Positive-tone patterns (like "Successfully completed") are NOT checked
+    // here because for passing runs the `details` field is the model's
+    // outbound prose, which never contains tool-call evidence strings.
+    // The `scenarioLacksToolCallEvidence` check would return true for ALL
+    // passes and false-positive on legitimate completions. Criterion 2
+    // ("no fake tool completion") is instead enforced by the per-scenario
+    // `/debug/requests` tool-call assertions from the scenario YAML flows.
+    return false;
+  }).length;
+
+  // Count only the scenarios that are supposed to exercise a real tool,
+  // subagent, or capability invocation. Memory recall and image-only
+  // understanding lanes stay in the parity pack, but they should not inflate
+  // the tool-call metric just by passing.
+  const toolBackedScenarioCount = scenarios.filter((scenario) =>
+    toolBackedTitleSet.has(scenario.name),
+  ).length;
+  const validToolCallCount = scenarios.filter(
+    (scenario) => toolBackedTitleSet.has(scenario.name) && scenario.status === "pass",
  ).length;

-  // First-wave parity scenarios are all tool-mediated tasks, so a passing scenario is our
-  // verified unit of valid tool-backed execution in this harness.
-  const validToolCallCount = passedScenarios;
-
  const rate = (value: number) => (totalScenarios > 0 ? value / totalScenarios : 0);
+  const toolRate = (value: number) =>
+    toolBackedScenarioCount > 0 ? value / toolBackedScenarioCount : 0;
  return {
    totalScenarios,
    passedScenarios,
@@ -130,7 +188,7 @@ export function computeQaAgenticParityMetrics(
    unintendedStopCount,
    unintendedStopRate: rate(unintendedStopCount),
    validToolCallCount,
-    validToolCallRate: rate(validToolCallCount),
+    validToolCallRate: toolRate(validToolCallCount),
    fakeSuccessCount,
  };
 }
@@ -149,14 +207,116 @@ function scopeSummaryToParityPack(
  summary: QaParitySuiteSummary,
  parityTitleSet: ReadonlySet<string>,
 ): QaParitySuiteSummary {
-  // The parity verdict must only consider the declared first-wave parity scenarios.
-  // Drop `counts` so the metric helper recomputes totals from the filtered scenario
-  // list instead of inheriting the caller's full-suite counters.
+  // The parity verdict must only consider the declared parity scenarios
+  // (the full first-wave + second-wave pack from QA_AGENTIC_PARITY_SCENARIOS).
+  // Drop `counts` so the metric helper recomputes totals from the filtered
+  // scenario list instead of inheriting the caller's full-suite counters.
  return {
    scenarios: summary.scenarios.filter((scenario) => parityTitleSet.has(scenario.name)),
+    ...(summary.run ? { run: summary.run } : {}),
  };
 }

+type StructuredQaParityLabel = {
+  provider: string;
+  model: string;
+};
+
+/**
+ * Only treat caller labels as provenance-checked identifiers when they are
+ * exact lower-case provider/model refs. Human-facing display labels like
+ * "GPT-5.4 candidate" or "Candidate: GPT-5.4" should render in the report
+ * without being misread as structured provider ids.
+ */
+function parseStructuredLabelRef(label: string): StructuredQaParityLabel | null {
+  const trimmed = label.trim();
+  if (trimmed.length === 0) {
+    return null;
+  }
+  if (trimmed !== trimmed.toLowerCase()) {
+    return null;
+  }
+  const separatorMatch = /^([a-z0-9][a-z0-9-]*)[/:]([a-z0-9][a-z0-9._-]*)$/.exec(trimmed);
+  if (!separatorMatch) {
+    return null;
+  }
+  return {
+    provider: separatorMatch[1] ?? "",
+    model: separatorMatch[2] ?? "",
+  };
+}
+
+/**
+ * Verify the `run.primaryProvider` + `run.primaryModel` fields on a summary
+ * match the caller-supplied label when that label is a structured
+ * `provider/model` or `provider:model` ref. PR L #64789 ships the `run`
+ * block; before it lands, older summaries don't have the field and this check
+ * is a no-op.
+ *
+ * Throws `QaParityLabelMismatchError` when the summary reports a different
+ * provider/model than the caller claimed — this catches the "swapped
+ * candidate and baseline summary paths" footgun the earlier adversarial
+ * review flagged. Returns silently when the fields are absent (legacy
+ * summaries) or when the fields match.
+ */
+function verifySummaryLabelMatch(params: {
+  summary: QaParitySuiteSummary;
+  label: string;
+  role: "candidate" | "baseline";
+}): void {
+  const runProvider = params.summary.run?.primaryProvider?.trim();
+  const runModel = params.summary.run?.primaryModel?.trim();
+  const runModelName = params.summary.run?.primaryModelName?.trim();
+  if (!runProvider || !runModel) {
+    return;
+  }
+  const labelRef = parseStructuredLabelRef(params.label);
+  if (!labelRef) {
+    return;
+  }
+  const normalizedRunModel = runModel.toLowerCase();
+  const normalizedRunModelName = runModelName?.toLowerCase();
+  const normalizedLabelModel = labelRef.model;
+  if (
+    runProvider.toLowerCase() === labelRef.provider &&
+    (normalizedRunModel === normalizedLabelModel ||
+      normalizedRunModelName === normalizedLabelModel ||
+      normalizedRunModel === `${labelRef.provider}/${normalizedLabelModel}`)
+  ) {
+    return;
+  }
+  throw new QaParityLabelMismatchError({
+    role: params.role,
+    label: params.label,
+    runProvider,
+    runModel,
+  });
+}
+
+export class QaParityLabelMismatchError extends Error {
+  readonly role: "candidate" | "baseline";
+  readonly label: string;
+  readonly runProvider: string;
+  readonly runModel: string;
+
+  constructor(params: {
+    role: "candidate" | "baseline";
+    label: string;
+    runProvider: string;
+    runModel: string;
+  }) {
+    super(
+      `${params.role} summary run.primaryProvider=${params.runProvider} and run.primaryModel=${params.runModel} do not match --${params.role}-label=${params.label}. ` +
+        `Check that the --candidate-summary / --baseline-summary paths weren't swapped.`,
+    );
+    this.name = "QaParityLabelMismatchError";
+    this.role = params.role;
+    this.label = params.label;
+    this.runProvider = params.runProvider;
+    this.runModel = params.runModel;
+  }
+}
+
 export function buildQaAgenticParityComparison(params: {
  candidateLabel: string;
  baselineLabel: string;
@@ -164,6 +324,22 @@ export function buildQaAgenticParityComparison(params: {
  baselineSummary: QaParitySuiteSummary;
  comparedAt?: string;
 }): QaAgenticParityComparison {
+  // Precondition: verify the `run.primaryProvider` field on each summary
+  // matches the caller-supplied label (when the `run` block is present).
+  // Throws `QaParityLabelMismatchError` on mismatch so the release gate
+  // fails loudly instead of silently producing a reversed verdict when an
+  // operator swaps the --candidate-summary and --baseline-summary paths.
+  // Legacy summaries without a `run` block are accepted as-is.
+  verifySummaryLabelMatch({
+    summary: params.candidateSummary,
+    label: params.candidateLabel,
+    role: "candidate",
+  });
+  verifySummaryLabelMatch({
+    summary: params.baselineSummary,
+    label: params.baselineLabel,
+    role: "baseline",
+  });
  const parityTitleSet: ReadonlySet<string> = new Set<string>(QA_AGENTIC_PARITY_SCENARIO_TITLES);
  // Rates and fake-success counts are computed from the parity-scoped summaries only,
  // so extra non-parity scenarios in the input (for example when a caller feeds a full
@@ -203,7 +379,7 @@ export function buildQaAgenticParityComparison(params: {
    });

  const failures: string[] = [];
-  const requiredScenarioCoverage = QA_AGENTIC_PARITY_SCENARIO_TITLES.map((name) => {
+  const requiredScenarioStatuses = QA_AGENTIC_PARITY_SCENARIO_TITLES.map((name) => {
    const candidate = candidateByName.get(name);
    const baseline = baselineByName.get(name);
    return {
@@ -211,7 +387,8 @@ export function buildQaAgenticParityComparison(params: {
      candidateStatus: requiredCoverageStatus(candidate),
      baselineStatus: requiredCoverageStatus(baseline),
    };
-  }).filter(
+  });
+  const requiredScenarioCoverage = requiredScenarioStatuses.filter(
    (scenario) =>
      scenario.candidateStatus === "missing" ||
      scenario.baselineStatus === "missing" ||
@@ -223,6 +400,26 @@ export function buildQaAgenticParityComparison(params: {
      `Missing required parity scenario coverage for ${scenario.name}: ${params.candidateLabel}=${scenario.candidateStatus}, ${params.baselineLabel}=${scenario.baselineStatus}.`,
    );
  }
+  // Required parity scenarios that ran on both sides but FAILED also fail
+  // the gate. Without this check, a run where both models fail the same
+  // required scenarios still produced pass=true, because the downstream
+  // metric comparisons are purely relative (candidate vs baseline) and
+  // the suspicious-pass fake-success check only catches passes that carry
+  // failure-sounding details. Excluding missing/skip here keeps operator
+  // output from double-counting the same scenario with two lines.
+  const requiredScenarioFailures = requiredScenarioStatuses.filter(
+    (scenario) =>
+      scenario.candidateStatus !== "missing" &&
+      scenario.baselineStatus !== "missing" &&
+      scenario.candidateStatus !== "skip" &&
+      scenario.baselineStatus !== "skip" &&
+      (scenario.candidateStatus === "fail" || scenario.baselineStatus === "fail"),
+  );
+  for (const scenario of requiredScenarioFailures) {
+    failures.push(
+      `Required parity scenario ${scenario.name} failed: ${params.candidateLabel}=${scenario.candidateStatus}, ${params.baselineLabel}=${scenario.baselineStatus}.`,
+    );
+  }
  // Required parity scenarios are already reported via `requiredScenarioCoverage`
  // above; excluding them here keeps the operator-facing failure list from
  // double-counting the same missing scenario (one "Missing required parity scenario
@@ -281,8 +478,13 @@ export function buildQaAgenticParityComparison(params: {
 }

 export function renderQaAgenticParityMarkdownReport(comparison: QaAgenticParityComparison): string {
+  // Title is parametrized from the candidate / baseline labels so reports
+  // for any candidate/baseline pair (not only gpt-5.4 vs opus 4.6) render
+  // with an accurate header. The default CLI labels are still
+  // openai/gpt-5.4 vs anthropic/claude-opus-4-6, but the helper works for
+  // any parity comparison a caller configures.
  const lines = [
-    "# OpenClaw GPT-5.4 / Opus 4.6 Agentic Parity Report",
+    `# OpenClaw Agentic Parity Report — ${comparison.candidateLabel} vs ${comparison.baselineLabel}`,
    "",
    `- Compared at: ${comparison.comparedAt}`,
    `- Candidate: ${comparison.candidateLabel}`,
--- a/extensions/qa-lab/src/agentic-parity.ts
+++ b/extensions/qa-lab/src/agentic-parity.ts
@@ -4,22 +4,57 @@ export const QA_AGENTIC_PARITY_SCENARIOS = [
  {
    id: "approval-turn-tool-followthrough",
    title: "Approval turn tool followthrough",
+    countsTowardValidToolCallRate: true,
  },
  {
    id: "model-switch-tool-continuity",
    title: "Model switch with tool continuity",
+    countsTowardValidToolCallRate: true,
  },
  {
    id: "source-docs-discovery-report",
    title: "Source and docs discovery report",
+    countsTowardValidToolCallRate: true,
  },
  {
    id: "image-understanding-attachment",
    title: "Image understanding from attachment",
+    countsTowardValidToolCallRate: false,
  },
  {
    id: "compaction-retry-mutating-tool",
    title: "Compaction retry after mutating tool",
+    countsTowardValidToolCallRate: true,
+  },
+  {
+    id: "subagent-handoff",
+    title: "Subagent handoff",
+    countsTowardValidToolCallRate: true,
+  },
+  {
+    id: "subagent-fanout-synthesis",
+    title: "Subagent fanout synthesis",
+    countsTowardValidToolCallRate: true,
+  },
+  {
+    id: "memory-recall",
+    title: "Memory recall after context switch",
+    countsTowardValidToolCallRate: false,
+  },
+  {
+    id: "thread-memory-isolation",
+    title: "Thread memory isolation",
+    countsTowardValidToolCallRate: true,
+  },
+  {
+    id: "config-restart-capability-flip",
+    title: "Config restart capability flip",
+    countsTowardValidToolCallRate: true,
+  },
+  {
+    id: "instruction-followthrough-repo-contract",
+    title: "Instruction followthrough repo contract",
+    countsTowardValidToolCallRate: true,
  },
 ] as const;

@@ -27,6 +62,9 @@ export const QA_AGENTIC_PARITY_SCENARIO_IDS = QA_AGENTIC_PARITY_SCENARIOS.map(({
 export const QA_AGENTIC_PARITY_SCENARIO_TITLES = QA_AGENTIC_PARITY_SCENARIOS.map(
  ({ title }) => title,
 );
+export const QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES = QA_AGENTIC_PARITY_SCENARIOS.filter(
+  ({ countsTowardValidToolCallRate }) => countsTowardValidToolCallRate,
+).map(({ title }) => title);

 export function resolveQaParityPackScenarioIds(params: {
  parityPack?: string;
--- a/extensions/qa-lab/src/cli.runtime.test.ts
+++ b/extensions/qa-lab/src/cli.runtime.test.ts
@@ -338,6 +338,12 @@ describe("qa cli runtime", () => {
          "source-docs-discovery-report",
          "image-understanding-attachment",
          "compaction-retry-mutating-tool",
+          "subagent-handoff",
+          "subagent-fanout-synthesis",
+          "memory-recall",
+          "thread-memory-isolation",
+          "config-restart-capability-flip",
+          "instruction-followthrough-repo-contract",
        ],
      }),
    );
@@ -566,6 +572,39 @@ describe("qa cli runtime", () => {
    );
  });

+  it("passes provider-qualified mock parity suite selection through to the host runner", async () => {
+    await runQaSuiteCommand({
+      repoRoot: "/tmp/openclaw-repo",
+      providerMode: "mock-openai",
+      parityPack: "agentic",
+      primaryModel: "openai/gpt-5.4",
+      alternateModel: "anthropic/claude-opus-4-6",
+    });
+
+    expect(runQaSuiteFromRuntime).toHaveBeenCalledWith({
+      repoRoot: path.resolve("/tmp/openclaw-repo"),
+      outputDir: undefined,
+      transportId: "qa-channel",
+      providerMode: "mock-openai",
+      primaryModel: "openai/gpt-5.4",
+      alternateModel: "anthropic/claude-opus-4-6",
+      fastMode: undefined,
+      scenarioIds: [
+        "approval-turn-tool-followthrough",
+        "model-switch-tool-continuity",
+        "source-docs-discovery-report",
+        "image-understanding-attachment",
+        "compaction-retry-mutating-tool",
+        "subagent-handoff",
+        "subagent-fanout-synthesis",
+        "memory-recall",
+        "thread-memory-isolation",
+        "config-restart-capability-flip",
+        "instruction-followthrough-repo-contract",
+      ],
+    });
+  });
+
  it("rejects multipass-only suite flags on the host runner", async () => {
    await expect(
      runQaSuiteCommand({
--- a/extensions/qa-lab/src/gateway-child.test.ts
+++ b/extensions/qa-lab/src/gateway-child.test.ts
@@ -64,6 +64,11 @@ describe("buildQaRuntimeEnv", () => {
    expect(env.GEMINI_API_KEY).toBe("gemini-live");
  });

+  it("defaults gateway-child provider mode to mock-openai when omitted", () => {
+    expect(__testing.resolveQaGatewayChildProviderMode(undefined)).toBe("mock-openai");
+    expect(__testing.resolveQaGatewayChildProviderMode("live-frontier")).toBe("live-frontier");
+  });
+
  it("keeps explicit provider env vars over live aliases", () => {
    const env = buildQaRuntimeEnv({
      ...createParams({
@@ -299,6 +304,88 @@ describe("buildQaRuntimeEnv", () => {
    });
  });

+  it("stages placeholder mock auth profiles per agent dir so mock-openai runs can resolve credentials", async () => {
+    const stateDir = await mkdtemp(path.join(os.tmpdir(), "qa-mock-auth-"));
+    cleanups.push(async () => {
+      await rm(stateDir, { recursive: true, force: true });
+    });
+
+    const cfg = await __testing.stageQaMockAuthProfiles({
+      cfg: {},
+      stateDir,
+    });
+
+    // Config side: both providers should have a profile entry with mode
+    // "api_key" so the runtime picks up the staging without any further
+    // config mutation.
+    expect(cfg.auth?.profiles?.["qa-mock-openai"]).toMatchObject({
+      provider: "openai",
+      mode: "api_key",
+      displayName: "QA mock openai credential",
+    });
+    expect(cfg.auth?.profiles?.["qa-mock-anthropic"]).toMatchObject({
+      provider: "anthropic",
+      mode: "api_key",
+      displayName: "QA mock anthropic credential",
+    });
+
+    // Store side: each agent dir should have its own auth-profiles.json
+    // containing the placeholder credential for each staged provider. This
+    // is what the scenario runner actually reads when it resolves auth
+    // before calling the mock.
+    for (const agentId of ["main", "qa"]) {
+      const storeRaw = await readFile(
+        path.join(stateDir, "agents", agentId, "agent", "auth-profiles.json"),
+        "utf8",
+      );
+      const parsed = JSON.parse(storeRaw) as {
+        profiles: Record<string, { type: string; provider: string; key: string }>;
+      };
+      expect(parsed.profiles["qa-mock-openai"]).toMatchObject({
+        type: "api_key",
+        provider: "openai",
+        key: "qa-mock-not-a-real-key",
+      });
+      expect(parsed.profiles["qa-mock-anthropic"]).toMatchObject({
+        type: "api_key",
+        provider: "anthropic",
+        key: "qa-mock-not-a-real-key",
+      });
+    }
+  });
+
+  it("stages mock profiles only for the requested agents and providers when callers override the defaults", async () => {
+    const stateDir = await mkdtemp(path.join(os.tmpdir(), "qa-mock-auth-override-"));
+    cleanups.push(async () => {
+      await rm(stateDir, { recursive: true, force: true });
+    });
+
+    const cfg = await __testing.stageQaMockAuthProfiles({
+      cfg: {},
+      stateDir,
+      agentIds: ["qa"],
+      providers: ["openai"],
+    });
+
+    expect(cfg.auth?.profiles?.["qa-mock-openai"]).toMatchObject({
+      provider: "openai",
+      mode: "api_key",
+    });
+    // Anthropic should NOT be staged when the caller restricts providers.
+    expect(cfg.auth?.profiles?.["qa-mock-anthropic"]).toBeUndefined();
+
+    const qaStore = JSON.parse(
+      await readFile(path.join(stateDir, "agents", "qa", "agent", "auth-profiles.json"), "utf8"),
+    ) as { profiles: Record<string, unknown> };
+    expect(qaStore.profiles["qa-mock-openai"]).toBeDefined();
+    expect(qaStore.profiles["qa-mock-anthropic"]).toBeUndefined();
+
+    // main/agent should not exist because it wasn't in the agentIds list.
+    await expect(
+      readFile(path.join(stateDir, "agents", "main", "agent", "auth-profiles.json"), "utf8"),
+    ).rejects.toThrow(/ENOENT/);
+  });
+
  it("allows loopback gateway health probes through the SSRF guard", async () => {
    const release = vi.fn(async () => {});
    fetchWithSsrFGuardMock.mockResolvedValue({
--- a/extensions/qa-lab/src/gateway-child.ts
+++ b/extensions/qa-lab/src/gateway-child.ts
@@ -222,6 +222,12 @@ export function normalizeQaProviderModeEnv(
  return env;
 }

+export function resolveQaGatewayChildProviderMode(
+  providerMode?: "mock-openai" | "live-frontier",
+): "mock-openai" | "live-frontier" {
+  return providerMode ?? "mock-openai";
+}
+
 function resolveQaLiveCliAuthEnv(
  baseEnv: NodeJS.ProcessEnv,
  opts?: {
@@ -395,6 +401,72 @@ export async function stageQaLiveAnthropicSetupToken(params: {
  });
 }

+/** Providers the mock-openai harness stages placeholder credentials for. */
+export const QA_MOCK_AUTH_PROVIDERS = Object.freeze(["openai", "anthropic"] as const);
+
+/** Agent IDs the mock-openai harness stages credentials under. */
+export const QA_MOCK_AUTH_AGENT_IDS = Object.freeze(["main", "qa"] as const);
+
+export function buildQaMockProfileId(provider: string): string {
+  return `qa-mock-${provider}`;
+}
+
+/**
+ * In mock-openai mode the qa suite runs against the embedded mock server
+ * instead of a real provider API. The mock does not validate credentials, but
+ * the agent auth layer still needs a matching `api_key` auth profile in
+ * `auth-profiles.json` before it will route the request through
+ * `providerBaseUrl`. Without this staging step, every scenario fails with
+ * `FailoverError: No API key found for provider "openai"` before the mock
+ * server ever sees a request.
+ *
+ * Stages a placeholder `api_key` profile per provider in each of the agent
+ * dirs the qa suite uses (`main` for the runtime config, `qa` for scenario
+ * runs) and returns a config with matching `auth.profiles` entries so the
+ * runtime accepts the profile on the first lookup.
+ *
+ * The placeholder value `qa-mock-not-a-real-key` is intentionally not
+ * shaped like a real API key (no `sk-` prefix that would trip secret
+ * scanners). It only needs to be non-empty to pass the credential
+ * serializer; anything beyond that is ignored by the mock.
+ */
+export async function stageQaMockAuthProfiles(params: {
+  cfg: OpenClawConfig;
+  stateDir: string;
+  agentIds?: readonly string[];
+  providers?: readonly string[];
+}): Promise<OpenClawConfig> {
+  const agentIds = [...new Set(params.agentIds ?? QA_MOCK_AUTH_AGENT_IDS)];
+  const providers = [...new Set(params.providers ?? QA_MOCK_AUTH_PROVIDERS)];
+  let next = params.cfg;
+  for (const agentId of agentIds) {
+    const agentDir = path.join(params.stateDir, "agents", agentId, "agent");
+    await fs.mkdir(agentDir, { recursive: true });
+    for (const provider of providers) {
+      const profileId = buildQaMockProfileId(provider);
+      upsertAuthProfile({
+        profileId,
+        credential: {
+          type: "api_key",
+          provider,
+          key: "qa-mock-not-a-real-key",
+          displayName: `QA mock ${provider} credential`,
+        },
+        agentDir,
+      });
+    }
+  }
+  for (const provider of providers) {
+    next = applyAuthProfileConfig(next, {
+      profileId: buildQaMockProfileId(provider),
+      provider,
+      mode: "api_key",
+      displayName: `QA mock ${provider} credential`,
+    });
+  }
+  return next;
+}
+
 function isRetryableGatewayCallError(details: string): boolean {
  return (
    details.includes("handshake timeout") ||
@@ -440,8 +512,10 @@ export const __testing = {
  preserveQaGatewayDebugArtifacts,
  redactQaGatewayDebugText,
  readQaLiveProviderConfigOverrides,
+  resolveQaGatewayChildProviderMode,
  resolveQaLiveAnthropicSetupToken,
  stageQaLiveAnthropicSetupToken,
+  stageQaMockAuthProfiles,
  resolveQaLiveCliAuthEnv,
  resolveQaOwnerPluginIdsForProviderIds,
  resolveQaBundledPluginsSourceRoot,
@@ -868,8 +942,9 @@ export async function startQaGatewayChild(params: {
    fs.mkdir(xdgDataHome, { recursive: true }),
    fs.mkdir(xdgCacheHome, { recursive: true }),
  ]);
+  const providerMode = resolveQaGatewayChildProviderMode(params.providerMode);
  const liveProviderIds =
-    params.providerMode === "live-frontier"
+    providerMode === "live-frontier"
      ? [params.primaryModel, params.alternateModel]
          .map((modelRef) =>
            typeof modelRef === "string" ? splitQaModelRef(modelRef)?.provider : undefined,
@@ -902,7 +977,7 @@ export async function startQaGatewayChild(params: {
        controlUiEnabled: params.controlUiEnabled,
      }),
      controlUiAllowedOrigins: params.controlUiAllowedOrigins,
-      providerMode: params.providerMode,
+      providerMode,
      primaryModel: params.primaryModel,
      alternateModel: params.alternateModel,
      enabledPluginIds,
@@ -921,6 +996,12 @@ export async function startQaGatewayChild(params: {
      cfg,
      stateDir,
    });
+    if (providerMode === "mock-openai") {
+      cfg = await stageQaMockAuthProfiles({
+        cfg,
+        stateDir,
+      });
+    }
    return params.mutateConfig ? params.mutateConfig(cfg) : cfg;
  };
  const stdout: Buffer[] = [];
@@ -981,7 +1062,7 @@ export async function startQaGatewayChild(params: {
          xdgCacheHome,
          bundledPluginsDir,
          compatibilityHostVersion: runtimeHostVersion,
-          providerMode: params.providerMode,
+          providerMode,
          forwardHostHomeForClaudeCli: liveProviderIds.includes("claude-cli"),
          claudeCliAuthMode: params.claudeCliAuthMode,
        });
--- a/extensions/qa-lab/src/mock-openai-server.test.ts
+++ b/extensions/qa-lab/src/mock-openai-server.test.ts
--- a/extensions/qa-lab/src/mock-openai-server.ts
+++ b/extensions/qa-lab/src/mock-openai-server.ts
@@ -22,6 +22,58 @@ type StreamEvent =
      };
    };

+/**
+ * Provider variant tag for `body.model`. The mock previously ignored
+ * `body.model` for dispatch and only echoed it in the prose output, which
+ * made the parity gate tautological when run against the mock alone
+ * (both providers produced identical scenario plans by construction).
+ * Tagging requests with a normalized variant lets individual scenario
+ * branches opt into provider-specific behavior while the rest of the
+ * dispatcher stays shared, and lets `/debug/requests` consumers verify
+ * which provider lane a given request came from without re-parsing the
+ * raw model string.
+ *
+ * Policy:
+ * - `openai/*`, `gpt-*`, `o1-*`, anything starting with `gpt-` → `"openai"`
+ * - `anthropic/*`, `claude-*` → `"anthropic"`
+ * - Everything else (including empty strings) → `"unknown"`
+ *
+ * The `/v1/messages` route always feeds `body.model` straight through,
+ * so an Anthropic request with an `openai/gpt-5.4` model string is still
+ * classified as `"openai"`. That matches the parity program's convention
+ * where the provider label is the source of truth, not the HTTP route.
+ */
+export type MockOpenAiProviderVariant = "openai" | "anthropic" | "unknown";
+
+export function resolveProviderVariant(model: string | undefined): MockOpenAiProviderVariant {
+  if (typeof model !== "string") {
+    return "unknown";
+  }
+  const trimmed = model.trim().toLowerCase();
+  if (trimmed.length === 0) {
+    return "unknown";
+  }
+  // Prefer the explicit `provider/model` or `provider:model` prefix when
+  // the caller supplied one — that's the most reliable signal.
+  const separatorMatch = /^([^/:]+)[/:]/.exec(trimmed);
+  const provider = separatorMatch?.[1] ?? trimmed;
+  if (provider === "openai" || provider === "openai-codex") {
+    return "openai";
+  }
+  if (provider === "anthropic" || provider === "claude-cli") {
+    return "anthropic";
+  }
+  // Fall back to model-name prefix matching for bare model strings like
+  // `gpt-5.4` or `claude-opus-4-6`.
+  if (/^(?:gpt-|o1-|openai-)/.test(trimmed)) {
+    return "openai";
+  }
+  if (/^(?:claude-|anthropic-)/.test(trimmed)) {
+    return "anthropic";
+  }
+  return "unknown";
+}
+
 type MockOpenAiRequestSnapshot = {
  raw: string;
  body: Record<string, unknown>;
@@ -30,13 +82,52 @@ type MockOpenAiRequestSnapshot = {
  instructions?: string;
  toolOutput: string;
  model: string;
+  providerVariant: MockOpenAiProviderVariant;
  imageInputCount: number;
  plannedToolName?: string;
 };

+// Anthropic /v1/messages request/response shapes the mock actually needs.
+// This is a subset of the real Anthropic Messages API — just enough so the
+// QA suite can run its parity pack against a "baseline" Anthropic provider
+// without needing real API keys. The scenarios drive their dispatch through
+// the shared mock scenario logic (buildResponsesPayload), so whatever
+// behavior the OpenAI mock exposes is automatically mirrored on this route.
+type AnthropicMessageContentBlock =
+  | { type: "text"; text: string }
+  | {
+      type: "tool_use";
+      id: string;
+      name: string;
+      input: Record<string, unknown>;
+    }
+  | {
+      type: "tool_result";
+      tool_use_id: string;
+      content: string | Array<{ type: "text"; text: string }>;
+    }
+  | { type: "image"; source: Record<string, unknown> };
+
+type AnthropicMessage = {
+  role: "user" | "assistant";
+  content: string | AnthropicMessageContentBlock[];
+};
+
+type AnthropicMessagesRequest = {
+  model?: string;
+  max_tokens?: number;
+  system?: string | Array<{ type: "text"; text: string }>;
+  messages?: AnthropicMessage[];
+  tools?: Array<Record<string, unknown>>;
+  stream?: boolean;
+};
+
 const TINY_PNG_BASE64 =
  "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/x8AAwMCAO7Z0nQAAAAASUVORK5CYII=";
-let subagentFanoutPhase = 0;
+
+type MockScenarioState = {
+  subagentFanoutPhase: number;
+};

 function readBody(req: IncomingMessage): Promise<string> {
  return new Promise((resolve, reject) => {
@@ -68,6 +159,23 @@ function writeSse(res: ServerResponse, events: StreamEvent[]) {
  res.end(body);
 }

+type AnthropicStreamEvent = Record<string, unknown> & {
+  type: string;
+};
+
+function writeAnthropicSse(res: ServerResponse, events: AnthropicStreamEvent[]) {
+  const body = events
+    .map((event) => `event: ${event.type}\ndata: ${JSON.stringify(event)}\n\n`)
+    .join("");
+  res.writeHead(200, {
+    "content-type": "text/event-stream",
+    "cache-control": "no-store",
+    connection: "keep-alive",
+    "content-length": Buffer.byteLength(body),
+  });
+  res.end(body);
+}
+
 function countApproxTokens(text: string) {
  const trimmed = text.trim();
  if (!trimmed) {
@@ -376,11 +484,11 @@ function extractLastCapture(text: string, pattern: RegExp) {
 }

 function extractExactReplyDirective(text: string) {
-  const colonMatch = extractLastCapture(text, /reply(?: with)? exactly:\s*([^\n]+)/i);
-  if (colonMatch) {
-    return colonMatch;
+  const backtickedMatch = extractLastCapture(text, /reply(?: with)? exactly\s+`([^`]+)`/i);
+  if (backtickedMatch) {
+    return backtickedMatch;
  }
-  return extractLastCapture(text, /reply(?: with)? exactly\s+`([^`]+)`/i);
+  return extractLastCapture(text, /reply(?: with)? exactly:\s*([^\n]+)/i);
 }

 function extractExactMarkerDirective(text: string) {
@@ -392,10 +500,18 @@ function extractExactMarkerDirective(text: string) {
 }

 function isHeartbeatPrompt(text: string) {
-  return /Read HEARTBEAT\.md if it exists/i.test(text);
+  const trimmed = text.trim();
+  if (!trimmed || /remember this fact/i.test(trimmed)) {
+    return false;
+  }
+  return /(?:^|\n)Read HEARTBEAT\.md if it exists\b/i.test(trimmed);
 }

-function buildAssistantText(input: ResponsesInputItem[], body: Record<string, unknown>) {
+function buildAssistantText(
+  input: ResponsesInputItem[],
+  body: Record<string, unknown>,
+  scenarioState: MockScenarioState,
+) {
  const prompt = extractLastUserText(input);
  const toolOutput = extractToolOutput(input);
  const toolJson = parseToolOutputJson(toolOutput);
@@ -411,8 +527,10 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
        : toolOutput;
  const orbitCode = extractOrbitCode(memorySnippet);
  const mediaPath = /MEDIA:([^\n]+)/.exec(toolOutput)?.[1]?.trim();
-  const exactReplyDirective = extractExactReplyDirective(allInputText);
-  const exactMarkerDirective = extractExactMarkerDirective(allInputText);
+  const exactReplyDirective =
+    extractExactReplyDirective(prompt) ?? extractExactReplyDirective(allInputText);
+  const exactMarkerDirective =
+    extractExactMarkerDirective(prompt) ?? extractExactMarkerDirective(allInputText);
  const imageInputCount = countImageInputs(input);
  const activeMemorySummary = extractActiveMemorySummary(allInputText);
  const snackPreference = extractSnackPreference(activeMemorySummary ?? memorySnippet);
@@ -456,6 +574,23 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
  if (/tool continuity check/i.test(prompt) && toolOutput) {
    return `Protocol note: model switch handoff confirmed on ${model || "the requested model"}. QA mission from QA_KICKOFF_TASK.md still applies: understand this OpenClaw repo from source + docs before acting.`;
  }
+  if (toolOutput && /repo contract followthrough check/i.test(prompt)) {
+    if (
+      /successfully (?:wrote|created|updated|replaced)/i.test(toolOutput) ||
+      /status:\s*complete/i.test(toolOutput)
+    ) {
+      return [
+        "Read: AGENT.md, SOUL.md, FOLLOWTHROUGH_INPUT.md",
+        "Wrote: repo-contract-summary.txt",
+        "Status: complete",
+      ].join("\n");
+    }
+    return [
+      "Read: AGENT.md, SOUL.md, FOLLOWTHROUGH_INPUT.md",
+      "Wrote: repo-contract-summary.txt",
+      "Status: blocked",
+    ].join("\n");
+  }
  if (/session memory ranking check/i.test(prompt) && orbitCode) {
    return `Protocol note: I checked memory and the current Project Nebula codename is ${orbitCode}.`;
  }
@@ -489,7 +624,11 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
  if (/fanout worker beta/i.test(prompt)) {
    return "BETA-OK";
  }
-  if (/subagent fanout synthesis check/i.test(prompt) && toolOutput && subagentFanoutPhase >= 2) {
+  if (
+    /subagent fanout synthesis check/i.test(prompt) &&
+    toolOutput &&
+    scenarioState.subagentFanoutPhase >= 2
+  ) {
    return "Protocol note: delegated fanout complete. Alpha=ALPHA-OK. Beta=BETA-OK.";
  }
  if (toolOutput && (/\bdelegate\b/i.test(prompt) || /subagent handoff/i.test(prompt))) {
@@ -579,7 +718,10 @@ function buildAssistantEvents(text: string): StreamEvent[] {
  ];
 }

-async function buildResponsesPayload(body: Record<string, unknown>) {
+async function buildResponsesPayload(
+  body: Record<string, unknown>,
+  scenarioState: MockScenarioState,
+) {
  const input = Array.isArray(body.input) ? (body.input as ResponsesInputItem[]) : [];
  const prompt = extractLastUserText(input);
  const toolOutput = extractToolOutput(input);
@@ -587,6 +729,9 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
  const allInputText = extractAllRequestTexts(input, body);
  const isGroupChat = allInputText.includes('"is_group_chat": true');
  const isBaselineUnmentionedChannelChatter = /\bno bot ping here\b/i.test(prompt);
+  if (/remember this fact/i.test(prompt)) {
+    return buildAssistantEvents(buildAssistantText(input, body, scenarioState));
+  }
  if (isHeartbeatPrompt(prompt)) {
    return buildAssistantEvents("HEARTBEAT_OK");
  }
@@ -756,16 +901,16 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
    });
  }
  if (/subagent fanout synthesis check/i.test(prompt)) {
-    if (!toolOutput && subagentFanoutPhase === 0) {
-      subagentFanoutPhase = 1;
+    if (!toolOutput && scenarioState.subagentFanoutPhase === 0) {
+      scenarioState.subagentFanoutPhase = 1;
      return buildToolCallEventsWithArgs("sessions_spawn", {
        task: "Fanout worker alpha: inspect the QA workspace and finish with exactly ALPHA-OK.",
        label: "qa-fanout-alpha",
        thread: false,
      });
    }
-    if (toolOutput && subagentFanoutPhase === 1) {
-      subagentFanoutPhase = 2;
+    if (toolOutput && scenarioState.subagentFanoutPhase === 1) {
+      scenarioState.subagentFanoutPhase = 2;
      return buildToolCallEventsWithArgs("sessions_spawn", {
        task: "Fanout worker beta: inspect the QA workspace and finish with exactly BETA-OK.",
        label: "qa-fanout-beta",
@@ -776,6 +921,30 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
  if (/tool continuity check/i.test(prompt) && !toolOutput) {
    return buildToolCallEventsWithArgs("read", { path: "QA_KICKOFF_TASK.md" });
  }
+  if (/repo contract followthrough check/i.test(prompt)) {
+    if (!toolOutput) {
+      return buildToolCallEventsWithArgs("read", { path: "AGENT.md" });
+    }
+    if (toolOutput.includes("# Repo contract")) {
+      return buildToolCallEventsWithArgs("read", { path: "SOUL.md" });
+    }
+    if (toolOutput.includes("# Execution style")) {
+      return buildToolCallEventsWithArgs("read", { path: "FOLLOWTHROUGH_INPUT.md" });
+    }
+    if (
+      toolOutput.includes("Mission: prove you followed the repo contract.") &&
+      toolOutput.includes("Evidence path: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md")
+    ) {
+      return buildToolCallEventsWithArgs("write", {
+        path: "repo-contract-summary.txt",
+        content: [
+          "Mission: prove you followed the repo contract.",
+          "Evidence: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md",
+          "Status: complete",
+        ].join("\n"),
+      });
+    }
+  }
  if ((/\bdelegate\b/i.test(prompt) || /subagent handoff/i.test(prompt)) && !toolOutput) {
    return buildToolCallEventsWithArgs("sessions_spawn", {
      task: "Inspect the QA workspace and return one concise protocol note.",
@@ -807,12 +976,390 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
  ) {
    await sleep(60_000);
  }
-  return buildAssistantEvents(buildAssistantText(input, body));
+  return buildAssistantEvents(buildAssistantText(input, body, scenarioState));
+}
+
+// ---------------------------------------------------------------------------
+// Anthropic /v1/messages adapter
+// ---------------------------------------------------------------------------
+//
+// The QA parity gate needs two comparable scenario runs: one against the
+// "candidate" (openai/gpt-5.4) and one against the "baseline"
+// (anthropic/claude-opus-4-6). The OpenAI mock above already dispatches all
+// the scenario prompt branches we care about. Rather than duplicating that
+// machinery, the /v1/messages route below translates Anthropic request
+// shapes into the shared ResponsesInputItem[] format, calls the same
+// buildResponsesPayload() dispatcher, and then re-serializes the resulting
+// events into an Anthropic response. This gives the parity harness a
+// baseline lane that exercises the same scenario logic without requiring
+// real Anthropic API keys.
+//
+// Scope: handles Anthropic Messages requests with text and tool_result
+// content blocks, supporting both non-streaming JSON responses and the
+// streaming SSE path used by the parity harness.
+
+function normalizeAnthropicSystemToString(
+  system: AnthropicMessagesRequest["system"],
+): string | undefined {
+  if (typeof system === "string") {
+    return system.trim() || undefined;
+  }
+  if (Array.isArray(system)) {
+    const joined = system
+      .map((block) => (block?.type === "text" ? block.text : ""))
+      .filter(Boolean)
+      .join("\n")
+      .trim();
+    return joined || undefined;
+  }
+  return undefined;
+}
+
+function stringifyToolResultContent(
+  content: Extract<AnthropicMessageContentBlock, { type: "tool_result" }>["content"],
+): string {
+  if (typeof content === "string") {
+    return content;
+  }
+  if (Array.isArray(content)) {
+    return content
+      .map((block) => (block?.type === "text" ? block.text : ""))
+      .filter(Boolean)
+      .join("\n");
+  }
+  return "";
+}
+
+function convertAnthropicMessagesToResponsesInput(params: {
+  system?: AnthropicMessagesRequest["system"];
+  messages: AnthropicMessage[];
+}): ResponsesInputItem[] {
+  const items: ResponsesInputItem[] = [];
+  const systemText = normalizeAnthropicSystemToString(params.system);
+  if (systemText) {
+    items.push({
+      role: "system",
+      content: [{ type: "input_text", text: systemText }],
+    });
+  }
+  for (const message of params.messages) {
+    const content = message.content;
+    if (typeof content === "string") {
+      items.push({
+        role: message.role,
+        content: [
+          message.role === "assistant"
+            ? { type: "output_text", text: content }
+            : { type: "input_text", text: content },
+        ],
+      });
+      continue;
+    }
+    if (!Array.isArray(content)) {
+      continue;
+    }
+    // Buffer each block type so we can push in OpenAI-Responses order instead
+    // of the order they appear in the Anthropic content array. The parent
+    // role message must precede any function_call_output items from the same
+    // turn, otherwise extractToolOutput() (which scans for
+    // function_call_output AFTER the last user-role index) will not see the
+    // output and the downstream scenario dispatcher will behave as if no
+    // tool output was returned. Similarly, assistant tool_use blocks become
+    // function_call items that must follow the assistant text message they
+    // narrate.
+    const textPieces: Array<{ type: "input_text" | "output_text"; text: string }> = [];
+    const imagePieces: Array<{ type: "input_image"; image_url: string }> = [];
+    const toolResultItems: ResponsesInputItem[] = [];
+    const toolUseItems: ResponsesInputItem[] = [];
+    for (const block of content) {
+      if (!block || typeof block !== "object") {
+        continue;
+      }
+      if (block.type === "text") {
+        textPieces.push({
+          type: message.role === "assistant" ? "output_text" : "input_text",
+          text: block.text ?? "",
+        });
+        continue;
+      }
+      if (block.type === "image") {
+        // Mock only needs to count image inputs; a placeholder URL is fine.
+        imagePieces.push({ type: "input_image", image_url: "anthropic-mock:image" });
+        continue;
+      }
+      if (block.type === "tool_result") {
+        const output = stringifyToolResultContent(block.content);
+        if (output.trim()) {
+          toolResultItems.push({ type: "function_call_output", output });
+        }
+        continue;
+      }
+      if (block.type === "tool_use") {
+        // Mirror OpenAI's function_call output_item shape so downstream
+        // prompt extraction still sees "the assistant just emitted a tool
+        // call". The scenario dispatcher looks for tool_output on the next
+        // user turn, not the assistant's prior tool_use, so a minimal
+        // placeholder is enough.
+        toolUseItems.push({
+          type: "function_call",
+          name: block.name,
+          arguments: JSON.stringify(block.input ?? {}),
+          call_id: block.id,
+        });
+        continue;
+      }
+    }
+    if (textPieces.length > 0 || imagePieces.length > 0) {
+      const combinedContent: Array<Record<string, unknown>> = [...textPieces, ...imagePieces];
+      items.push({ role: message.role, content: combinedContent });
+    }
+    // Emit tool_use (assistant prior calls) and tool_result (user-side
+    // returns) AFTER the parent role message so extractLastUserText and
+    // extractToolOutput walk the array in the order they expect. For a
+    // tool_result-only user turn with no text/image blocks, the parent
+    // message is intentionally omitted — the function_call_output itself
+    // represents the user's "return the tool output" turn.
+    for (const toolUse of toolUseItems) {
+      items.push(toolUse);
+    }
+    for (const toolResult of toolResultItems) {
+      items.push(toolResult);
+    }
+  }
+  return items;
+}
+
+type ExtractedAssistantOutput = {
+  text: string;
+  toolCalls: Array<{ id: string; name: string; input: Record<string, unknown> }>;
+};
+
+function extractFinalAssistantOutputFromEvents(events: StreamEvent[]): ExtractedAssistantOutput {
+  const toolCalls: ExtractedAssistantOutput["toolCalls"] = [];
+  let text = "";
+  for (const event of events) {
+    if (event.type !== "response.output_item.done") {
+      continue;
+    }
+    const item = event.item as {
+      type?: unknown;
+      name?: unknown;
+      call_id?: unknown;
+      id?: unknown;
+      arguments?: unknown;
+      content?: unknown;
+    };
+    if (item.type === "function_call" && typeof item.name === "string") {
+      let input: Record<string, unknown> = {};
+      if (typeof item.arguments === "string" && item.arguments.trim()) {
+        try {
+          const parsed = JSON.parse(item.arguments) as unknown;
+          if (parsed && typeof parsed === "object" && !Array.isArray(parsed)) {
+            input = parsed as Record<string, unknown>;
+          }
+        } catch {
+          // keep empty input on malformed args — mock dispatcher owns arg shape
+        }
+      }
+      toolCalls.push({
+        id: typeof item.call_id === "string" ? item.call_id : `toolu_mock_${toolCalls.length + 1}`,
+        name: item.name,
+        input,
+      });
+      continue;
+    }
+    if (item.type === "message" && Array.isArray(item.content)) {
+      for (const piece of item.content as Array<{ type?: unknown; text?: unknown }>) {
+        if (piece?.type === "output_text" && typeof piece.text === "string") {
+          text = piece.text;
+        }
+      }
+    }
+  }
+  return { text, toolCalls };
+}
+
+function buildAnthropicMessageResponse(params: {
+  model: string;
+  extracted: ExtractedAssistantOutput;
+}): Record<string, unknown> {
+  const content: Array<Record<string, unknown>> = [];
+  if (params.extracted.text) {
+    content.push({ type: "text", text: params.extracted.text });
+  }
+  for (const call of params.extracted.toolCalls) {
+    content.push({
+      type: "tool_use",
+      id: call.id,
+      name: call.name,
+      input: call.input,
+    });
+  }
+  if (content.length === 0) {
+    content.push({ type: "text", text: "" });
+  }
+  const stopReason = params.extracted.toolCalls.length > 0 ? "tool_use" : "end_turn";
+  const approxInputTokens = 64;
+  const approxOutputTokens = Math.max(
+    16,
+    countApproxTokens(params.extracted.text) + params.extracted.toolCalls.length * 16,
+  );
+  return {
+    id: `msg_mock_${Math.floor(Math.random() * 1_000_000).toString(16)}`,
+    type: "message",
+    role: "assistant",
+    model: params.model || "claude-opus-4-6",
+    content,
+    stop_reason: stopReason,
+    stop_sequence: null,
+    usage: {
+      input_tokens: approxInputTokens,
+      output_tokens: approxOutputTokens,
+    },
+  };
+}
+
+function buildAnthropicMessageStreamEvents(params: {
+  model: string;
+  extracted: ExtractedAssistantOutput;
+}): AnthropicStreamEvent[] {
+  const approxInputTokens = 64;
+  const approxOutputTokens = Math.max(
+    16,
+    countApproxTokens(params.extracted.text) + params.extracted.toolCalls.length * 16,
+  );
+  const messageId = `msg_mock_${Math.floor(Math.random() * 1_000_000).toString(16)}`;
+  const events: AnthropicStreamEvent[] = [
+    {
+      type: "message_start",
+      message: {
+        id: messageId,
+        type: "message",
+        role: "assistant",
+        model: params.model || "claude-opus-4-6",
+        content: [],
+        stop_reason: null,
+        stop_sequence: null,
+        usage: {
+          input_tokens: approxInputTokens,
+          output_tokens: 0,
+        },
+      },
+    },
+  ];
+  let index = 0;
+  if (params.extracted.text || params.extracted.toolCalls.length === 0) {
+    events.push({
+      type: "content_block_start",
+      index,
+      content_block: {
+        type: "text",
+        text: "",
+      },
+    });
+    if (params.extracted.text) {
+      events.push({
+        type: "content_block_delta",
+        index,
+        delta: {
+          type: "text_delta",
+          text: params.extracted.text,
+        },
+      });
+    }
+    events.push({
+      type: "content_block_stop",
+      index,
+    });
+    index += 1;
+  }
+  for (const call of params.extracted.toolCalls) {
+    events.push({
+      type: "content_block_start",
+      index,
+      content_block: {
+        type: "tool_use",
+        id: call.id,
+        name: call.name,
+        input: {},
+      },
+    });
+    events.push({
+      type: "content_block_delta",
+      index,
+      delta: {
+        type: "input_json_delta",
+        partial_json: JSON.stringify(call.input ?? {}),
+      },
+    });
+    events.push({
+      type: "content_block_stop",
+      index,
+    });
+    index += 1;
+  }
+  events.push({
+    type: "message_delta",
+    delta: {
+      stop_reason: params.extracted.toolCalls.length > 0 ? "tool_use" : "end_turn",
+    },
+    usage: {
+      input_tokens: approxInputTokens,
+      output_tokens: approxOutputTokens,
+    },
+  });
+  events.push({
+    type: "message_stop",
+  });
+  return events;
+}
+
+async function buildMessagesPayload(
+  body: AnthropicMessagesRequest,
+  scenarioState: MockScenarioState,
+): Promise<{
+  events: StreamEvent[];
+  input: ResponsesInputItem[];
+  extracted: ExtractedAssistantOutput;
+  responseBody: Record<string, unknown>;
+  streamEvents: AnthropicStreamEvent[];
+  model: string;
+}> {
+  const messages = Array.isArray(body.messages) ? body.messages : [];
+  const input = convertAnthropicMessagesToResponsesInput({
+    system: body.system,
+    messages,
+  });
+  // Treat empty-string model the same as absent. A bare typeof check lets
+  // `""` leak through to `responseBody.model` and `lastRequest.model`,
+  // which then confuses parity consumers that assume the mock always
+  // echoes the real provider label. Normalize once and reuse everywhere.
+  const normalizedModel =
+    typeof body.model === "string" && body.model.trim() !== "" ? body.model : "claude-opus-4-6";
+  // Dispatch through the same scenario logic the /v1/responses route uses.
+  // The mock dispatcher only reads `body.input`, `body.model`, and
+  // `body.stream`, so a synthetic shim body is sufficient.
+  const dispatchBody: Record<string, unknown> = {
+    input,
+    model: normalizedModel,
+    stream: false,
+  };
+  const events = await buildResponsesPayload(dispatchBody, scenarioState);
+  const extracted = extractFinalAssistantOutputFromEvents(events);
+  const responseBody = buildAnthropicMessageResponse({
+    model: normalizedModel,
+    extracted,
+  });
+  const streamEvents = buildAnthropicMessageStreamEvents({
+    model: normalizedModel,
+    extracted,
+  });
+  return { events, input, extracted, responseBody, streamEvents, model: normalizedModel };
 }

 export async function startQaMockOpenAiServer(params?: { host?: string; port?: number }) {
  const host = params?.host ?? "127.0.0.1";
-  subagentFanoutPhase = 0;
+  const scenarioState: MockScenarioState = { subagentFanoutPhase: 0 };
  let lastRequest: MockOpenAiRequestSnapshot | null = null;
  const requests: MockOpenAiRequestSnapshot[] = [];
  const imageGenerationRequests: Array<Record<string, unknown>> = [];
@@ -829,6 +1376,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
          { id: "gpt-5.4-alt", object: "model" },
          { id: "gpt-image-1", object: "model" },
          { id: "text-embedding-3-small", object: "model" },
+          { id: "claude-opus-4-6", object: "model" },
+          { id: "claude-sonnet-4-6", object: "model" },
        ],
      });
      return;
@@ -888,7 +1437,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
      const raw = await readBody(req);
      const body = raw ? (JSON.parse(raw) as Record<string, unknown>) : {};
      const input = Array.isArray(body.input) ? (body.input as ResponsesInputItem[]) : [];
-      const events = await buildResponsesPayload(body);
+      const events = await buildResponsesPayload(body, scenarioState);
+      const resolvedModel = typeof body.model === "string" ? body.model : "";
      lastRequest = {
        raw,
        body,
@@ -896,7 +1446,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
        allInputText: extractAllRequestTexts(input, body),
        instructions: extractInstructionsText(body) || undefined,
        toolOutput: extractToolOutput(input),
-        model: typeof body.model === "string" ? body.model : "",
+        model: resolvedModel,
+        providerVariant: resolveProviderVariant(resolvedModel),
        imageInputCount: countImageInputs(input),
        plannedToolName: extractPlannedToolName(events),
      };
@@ -916,6 +1467,56 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
      writeSse(res, events);
      return;
    }
+    if (req.method === "POST" && url.pathname === "/v1/messages") {
+      const raw = await readBody(req);
+      let body: AnthropicMessagesRequest = {};
+      try {
+        body = raw ? (JSON.parse(raw) as AnthropicMessagesRequest) : {};
+      } catch {
+        writeJson(res, 400, {
+          type: "error",
+          error: {
+            type: "invalid_request_error",
+            message: "Malformed JSON body for Anthropic Messages request.",
+          },
+        });
+        return;
+      }
+      const {
+        events,
+        input,
+        responseBody,
+        streamEvents,
+        model: normalizedModel,
+      } = await buildMessagesPayload(body, scenarioState);
+      // Record the adapted request snapshot so /debug/requests gives the QA
+      // suite the same plannedToolName / allInputText / toolOutput signals
+      // on the Anthropic route that the OpenAI route already exposes. This
+      // is what lets a single parity run diff assertions across both lanes.
+      // Reuse the normalized model so an empty-string body.model no longer
+      // leaks through to `lastRequest.model`.
+      lastRequest = {
+        raw,
+        body: body as Record<string, unknown>,
+        prompt: extractLastUserText(input),
+        allInputText: extractAllInputTexts(input),
+        toolOutput: extractToolOutput(input),
+        model: normalizedModel,
+        providerVariant: resolveProviderVariant(normalizedModel),
+        imageInputCount: countImageInputs(input),
+        plannedToolName: extractPlannedToolName(events),
+      };
+      requests.push(lastRequest);
+      if (requests.length > 50) {
+        requests.splice(0, requests.length - 50);
+      }
+      if (body.stream === true) {
+        writeAnthropicSse(res, streamEvents);
+        return;
+      }
+      writeJson(res, 200, responseBody);
+      return;
+    }
    writeJson(res, 404, { error: "not found" });
  });

--- a/extensions/qa-lab/src/qa-gateway-config.test.ts
+++ b/extensions/qa-lab/src/qa-gateway-config.test.ts
@@ -53,6 +53,11 @@ describe("buildQaGatewayConfig", () => {

    expect(getPrimaryModel(cfg.agents?.defaults?.model)).toBe("mock-openai/gpt-5.4");
    expect(cfg.models?.providers?.["mock-openai"]?.baseUrl).toBe("http://127.0.0.1:44080/v1");
+    expect(cfg.models?.providers?.["mock-openai"]?.request).toEqual({ allowPrivateNetwork: true });
+    expect(cfg.models?.providers?.openai?.baseUrl).toBe("http://127.0.0.1:44080/v1");
+    expect(cfg.models?.providers?.openai?.request).toEqual({ allowPrivateNetwork: true });
+    expect(cfg.models?.providers?.anthropic?.baseUrl).toBe("http://127.0.0.1:44080");
+    expect(cfg.models?.providers?.anthropic?.request).toEqual({ allowPrivateNetwork: true });
    expect(cfg.plugins?.allow).toEqual(["memory-core", "qa-channel"]);
    expect(cfg.plugins?.entries?.["memory-core"]).toEqual({ enabled: true });
    expect(cfg.plugins?.entries?.["qa-channel"]).toEqual({ enabled: true });
@@ -66,6 +71,31 @@ describe("buildQaGatewayConfig", () => {
    expect(cfg.messages?.groupChat?.mentionPatterns).toEqual(["\\b@?openclaw\\b"]);
  });

+  it("maps provider-qualified openai and anthropic refs through the mock provider lane", () => {
+    const cfg = buildQaGatewayConfig({
+      bind: "loopback",
+      gatewayPort: 18789,
+      gatewayToken: "token",
+      providerBaseUrl: "http://127.0.0.1:44080/v1",
+      workspaceDir: "/tmp/qa-workspace",
+      providerMode: "mock-openai",
+      primaryModel: "openai/gpt-5.4",
+      alternateModel: "anthropic/claude-opus-4-6",
+    });
+
+    expect(getPrimaryModel(cfg.agents?.defaults?.model)).toBe("openai/gpt-5.4");
+    expect(cfg.models?.providers?.openai?.api).toBe("openai-responses");
+    expect(cfg.models?.providers?.openai?.request).toEqual({ allowPrivateNetwork: true });
+    expect(cfg.models?.providers?.openai?.models.map((model) => model.id)).toContain("gpt-5.4");
+    expect(cfg.models?.providers?.anthropic?.api).toBe("anthropic-messages");
+    expect(cfg.models?.providers?.anthropic?.baseUrl).toBe("http://127.0.0.1:44080");
+    expect(cfg.models?.providers?.anthropic?.request).toEqual({ allowPrivateNetwork: true });
+    expect(cfg.models?.providers?.anthropic?.models.map((model) => model.id)).toContain(
+      "claude-opus-4-6",
+    );
+    expect(cfg.plugins?.allow).toEqual(["memory-core"]);
+  });
+
  it("can omit qa-channel for live transport gateway children", () => {
    const cfg = buildQaGatewayConfig({
      bind: "loopback",
--- a/extensions/qa-lab/src/qa-gateway-config.ts
+++ b/extensions/qa-lab/src/qa-gateway-config.ts
@@ -45,6 +45,10 @@ export function normalizeQaThinkingLevel(input: unknown): QaThinkingLevel | unde
  return undefined;
 }

+function trimTrailingApiV1(baseUrl: string) {
+  return baseUrl.replace(/\/v1\/?$/i, "");
+}
+
 export function mergeQaControlUiAllowedOrigins(extraOrigins?: string[]) {
  const normalizedExtra = (extraOrigins ?? [])
    .map((origin) => origin.trim())
@@ -74,10 +78,14 @@ export function buildQaGatewayConfig(params: {
  thinkingDefault?: QaThinkingLevel;
 }): OpenClawConfig {
  const mockProviderBaseUrl = params.providerBaseUrl ?? "http://127.0.0.1:44080/v1";
+  const mockAnthropicBaseUrl = trimTrailingApiV1(mockProviderBaseUrl);
  const mockOpenAiProvider: ModelProviderConfig = {
    baseUrl: mockProviderBaseUrl,
    apiKey: "test",
    api: "openai-responses",
+    request: {
+      allowPrivateNetwork: true,
+    },
    models: [
      {
        id: "gpt-5.4",
@@ -126,6 +134,50 @@ export function buildQaGatewayConfig(params: {
      },
    ],
  };
+  const mockNamedOpenAiProvider: ModelProviderConfig = {
+    ...mockOpenAiProvider,
+    models: mockOpenAiProvider.models.map((model) => ({ ...model })),
+  };
+  const mockAnthropicProvider: ModelProviderConfig = {
+    baseUrl: mockAnthropicBaseUrl,
+    apiKey: "test",
+    api: "anthropic-messages",
+    request: {
+      allowPrivateNetwork: true,
+    },
+    models: [
+      {
+        id: "claude-opus-4-6",
+        name: "claude-opus-4-6",
+        api: "anthropic-messages",
+        reasoning: false,
+        input: ["text", "image"],
+        cost: {
+          input: 0,
+          output: 0,
+          cacheRead: 0,
+          cacheWrite: 0,
+        },
+        contextWindow: 200_000,
+        maxTokens: 4096,
+      },
+      {
+        id: "claude-sonnet-4-6",
+        name: "claude-sonnet-4-6",
+        api: "anthropic-messages",
+        reasoning: false,
+        input: ["text", "image"],
+        cost: {
+          input: 0,
+          output: 0,
+          cacheRead: 0,
+          cacheWrite: 0,
+        },
+        contextWindow: 200_000,
+        maxTokens: 4096,
+      },
+    ],
+  };
  const providerMode = normalizeQaProviderMode(params.providerMode ?? "mock-openai");
  const primaryModel = params.primaryModel ?? defaultQaModelForMode(providerMode);
  const alternateModel =
@@ -273,6 +325,8 @@ export function buildQaGatewayConfig(params: {
            mode: "replace",
            providers: {
              "mock-openai": mockOpenAiProvider,
+              openai: mockNamedOpenAiProvider,
+              anthropic: mockAnthropicProvider,
            },
          },
        }
--- a/extensions/qa-lab/src/scenario-catalog.test.ts
+++ b/extensions/qa-lab/src/scenario-catalog.test.ts
@@ -118,6 +118,50 @@ describe("qa scenario catalog", () => {
    );
  });

+  it("keeps mock-only image debug assertions guarded in live-frontier runs", () => {
+    const scenario = readQaScenarioPack().scenarios.find(
+      (candidate) => candidate.id === "image-understanding-attachment",
+    );
+    const imageRequestAction = scenario?.execution.flow?.steps
+      .flatMap((step) => step.actions ?? [])
+      .find(
+        (
+          action,
+        ): action is {
+          set: string;
+          value?: { expr?: string };
+        } =>
+          typeof action === "object" &&
+          action !== null &&
+          "set" in action &&
+          action.set === "imageRequest",
+      );
+    const imageRequestExpr = imageRequestAction?.value?.expr;
+
+    expect(imageRequestExpr).toContain("env.mock ?");
+    expect(imageRequestExpr).toContain("/debug/requests");
+  });
+
+  it("adds a repo-instruction followthrough scenario to the parity pack", () => {
+    const scenario = readQaScenarioById("instruction-followthrough-repo-contract");
+    const config = readQaScenarioExecutionConfig("instruction-followthrough-repo-contract") as
+      | {
+          workspaceFiles?: Record<string, string>;
+          prompt?: string;
+          expectedReplyAll?: string[];
+        }
+      | undefined;
+
+    expect(config?.workspaceFiles?.["AGENT.md"]).toContain("Step order:");
+    expect(config?.workspaceFiles?.["SOUL.md"]).toContain("action-first");
+    expect(config?.workspaceFiles?.["FOLLOWTHROUGH_INPUT.md"]).toContain(
+      "Mission: prove you followed the repo contract.",
+    );
+    expect(config?.prompt).toContain("Repo contract followthrough check.");
+    expect(config?.expectedReplyAll).toEqual(["read:", "wrote:", "status:"]);
+    expect(scenario.title).toBe("Instruction followthrough repo contract");
+  });
+
  it("rejects malformed string matcher lists before running a flow", () => {
    expect(() =>
      validateQaScenarioExecutionConfig({
--- a/extensions/qa-lab/src/suite.summary-json.test.ts
+++ b/extensions/qa-lab/src/suite.summary-json.test.ts
@@ -0,0 +1,101 @@
+import { describe, expect, it } from "vitest";
+import { buildQaSuiteSummaryJson } from "./suite.js";
+
+describe("buildQaSuiteSummaryJson", () => {
+  const baseParams = {
+    // Test scenarios include a `steps: []` field to match the real suite
+    // scenario-result shape so downstream consumers that rely on the shape
+    // (parity gate, report render) stay aligned.
+    scenarios: [
+      { name: "Scenario A", status: "pass" as const, steps: [] },
+      { name: "Scenario B", status: "fail" as const, details: "something broke", steps: [] },
+    ],
+    startedAt: new Date("2026-04-11T00:00:00.000Z"),
+    finishedAt: new Date("2026-04-11T00:05:00.000Z"),
+    providerMode: "mock-openai" as const,
+    primaryModel: "openai/gpt-5.4",
+    alternateModel: "openai/gpt-5.4-alt",
+    fastMode: true,
+    concurrency: 2,
+  };
+
+  it("records provider/model/mode so parity gates can verify labels", () => {
+    const json = buildQaSuiteSummaryJson(baseParams);
+    expect(json.run).toMatchObject({
+      startedAt: "2026-04-11T00:00:00.000Z",
+      finishedAt: "2026-04-11T00:05:00.000Z",
+      providerMode: "mock-openai",
+      primaryModel: "openai/gpt-5.4",
+      primaryProvider: "openai",
+      primaryModelName: "gpt-5.4",
+      alternateModel: "openai/gpt-5.4-alt",
+      alternateProvider: "openai",
+      alternateModelName: "gpt-5.4-alt",
+      fastMode: true,
+      concurrency: 2,
+      scenarioIds: null,
+    });
+  });
+
+  it("includes scenarioIds in run metadata when provided", () => {
+    const scenarioIds = ["approval-turn-tool-followthrough", "subagent-handoff", "memory-recall"];
+    const json = buildQaSuiteSummaryJson({
+      ...baseParams,
+      scenarioIds,
+    });
+    expect(json.run.scenarioIds).toEqual(scenarioIds);
+  });
+
+  it("treats an empty scenarioIds array as unspecified (no filter)", () => {
+    // A CLI path that omits --scenario passes an empty array to runQaSuite.
+    // The summary must encode that as null so downstream parity/report
+    // tooling doesn't interpret a full run as an explicit empty selection.
+    const json = buildQaSuiteSummaryJson({
+      ...baseParams,
+      scenarioIds: [],
+    });
+    expect(json.run.scenarioIds).toBeNull();
+  });
+
+  it("records an Anthropic baseline lane cleanly for parity runs", () => {
+    const json = buildQaSuiteSummaryJson({
+      ...baseParams,
+      primaryModel: "anthropic/claude-opus-4-6",
+      alternateModel: "anthropic/claude-sonnet-4-6",
+    });
+    expect(json.run).toMatchObject({
+      primaryModel: "anthropic/claude-opus-4-6",
+      primaryProvider: "anthropic",
+      primaryModelName: "claude-opus-4-6",
+      alternateModel: "anthropic/claude-sonnet-4-6",
+      alternateProvider: "anthropic",
+      alternateModelName: "claude-sonnet-4-6",
+    });
+  });
+
+  it("leaves split fields null when a model ref is malformed", () => {
+    const json = buildQaSuiteSummaryJson({
+      ...baseParams,
+      primaryModel: "not-a-real-ref",
+      alternateModel: "",
+    });
+    expect(json.run).toMatchObject({
+      primaryModel: "not-a-real-ref",
+      primaryProvider: null,
+      primaryModelName: null,
+      alternateModel: "",
+      alternateProvider: null,
+      alternateModelName: null,
+    });
+  });
+
+  it("keeps scenarios and counts alongside the run metadata", () => {
+    const json = buildQaSuiteSummaryJson(baseParams);
+    expect(json.scenarios).toHaveLength(2);
+    expect(json.counts).toEqual({
+      total: 2,
+      passed: 1,
+      failed: 1,
+    });
+  });
+});
--- a/extensions/qa-lab/src/suite.ts
+++ b/extensions/qa-lab/src/suite.ts
@@ -81,7 +81,7 @@ type QaSuiteStep = {
  run: () => Promise<string | void>;
 };

-type QaSuiteScenarioResult = {
+export type QaSuiteScenarioResult = {
  name: string;
  status: "pass" | "fail";
  steps: QaReportCheck[];
@@ -1365,17 +1365,105 @@ function createQaSuiteReportNotes(params: {
  return params.transport.createReportNotes(params);
 }

+export type QaSuiteSummaryJsonParams = {
+  scenarios: QaSuiteScenarioResult[];
+  startedAt: Date;
+  finishedAt: Date;
+  providerMode: QaProviderMode;
+  primaryModel: string;
+  alternateModel: string;
+  fastMode: boolean;
+  concurrency: number;
+  scenarioIds?: readonly string[];
+};
+
+/**
+ * Strongly-typed shape of `qa-suite-summary.json`. The GPT-5.4 parity gate
+ * (agentic-parity-report.ts, #64441) and any future parity wrapper can
+ * import this type instead of re-declaring the shape, so changes to the
+ * summary schema propagate through to every consumer at type-check time.
+ */
+export type QaSuiteSummaryJson = {
+  scenarios: QaSuiteScenarioResult[];
+  counts: {
+    total: number;
+    passed: number;
+    failed: number;
+  };
+  run: {
+    startedAt: string;
+    finishedAt: string;
+    providerMode: QaProviderMode;
+    primaryModel: string;
+    primaryProvider: string | null;
+    primaryModelName: string | null;
+    alternateModel: string;
+    alternateProvider: string | null;
+    alternateModelName: string | null;
+    fastMode: boolean;
+    concurrency: number;
+    scenarioIds: string[] | null;
+  };
+};
+
+/**
+ * Pure-ish JSON builder for qa-suite-summary.json. Exported so the GPT-5.4
+ * parity gate (agentic-parity-report.ts, #64441) and any future parity
+ * runner can assert-and-trust the provider/model that produced a given
+ * summary instead of blindly accepting the caller's candidateLabel /
+ * baselineLabel. Without the `run` block, a maintainer who swaps candidate
+ * and baseline summary paths could silently produce a mislabeled verdict.
+ *
+ * `scenarioIds` is only recorded when the caller passed a non-empty array
+ * (an explicit scenario selection). A missing or empty array means "no
+ * filter, full lane-selected catalog", which the summary encodes as `null`
+ * so parity/report tooling doesn't mistake a full run for an explicit
+ * empty selection.
+ */
+export function buildQaSuiteSummaryJson(params: QaSuiteSummaryJsonParams): QaSuiteSummaryJson {
+  const primarySplit = splitModelRef(params.primaryModel);
+  const alternateSplit = splitModelRef(params.alternateModel);
+  return {
+    scenarios: params.scenarios,
+    counts: {
+      total: params.scenarios.length,
+      passed: params.scenarios.filter((scenario) => scenario.status === "pass").length,
+      failed: params.scenarios.filter((scenario) => scenario.status === "fail").length,
+    },
+    run: {
+      startedAt: params.startedAt.toISOString(),
+      finishedAt: params.finishedAt.toISOString(),
+      providerMode: params.providerMode,
+      primaryModel: params.primaryModel,
+      primaryProvider: primarySplit?.provider ?? null,
+      primaryModelName: primarySplit?.model ?? null,
+      alternateModel: params.alternateModel,
+      alternateProvider: alternateSplit?.provider ?? null,
+      alternateModelName: alternateSplit?.model ?? null,
+      fastMode: params.fastMode,
+      concurrency: params.concurrency,
+      scenarioIds:
+        params.scenarioIds && params.scenarioIds.length > 0 ? [...params.scenarioIds] : null,
+    },
+  };
+}
+
 async function writeQaSuiteArtifacts(params: {
  outputDir: string;
  startedAt: Date;
  finishedAt: Date;
  scenarios: QaSuiteScenarioResult[];
  transport: QaTransportAdapter;
-  providerMode: "mock-openai" | "live-frontier";
+  // Reuse the canonical QaProviderMode union instead of re-declaring it
+  // inline. Loop 6 already unified `QaSuiteSummaryJsonParams.providerMode`
+  // on this type; keeping the writer in sync prevents drift when model-
+  // selection.ts adds a new provider mode.
+  providerMode: QaProviderMode;
  primaryModel: string;
  alternateModel: string;
  fastMode: boolean;
  concurrency: number;
+  scenarioIds?: readonly string[];
 }) {
  const report = renderQaMarkdownReport({
    title: "OpenClaw QA Scenario Suite",
@@ -1395,18 +1483,7 @@ async function writeQaSuiteArtifacts(params: {
  await fs.writeFile(reportPath, report, "utf8");
  await fs.writeFile(
    summaryPath,
-    `${JSON.stringify(
-      {
-        scenarios: params.scenarios,
-        counts: {
-          total: params.scenarios.length,
-          passed: params.scenarios.filter((scenario) => scenario.status === "pass").length,
-          failed: params.scenarios.filter((scenario) => scenario.status === "fail").length,
-        },
-      },
-      null,
-      2,
-    )}\n`,
+    `${JSON.stringify(buildQaSuiteSummaryJson(params), null, 2)}\n`,
    "utf8",
  );
  return { report, reportPath, summaryPath };
@@ -1576,6 +1653,16 @@ export async function runQaSuite(params?: QaSuiteRunParams): Promise<QaSuiteResu
        alternateModel,
        fastMode,
        concurrency,
+        // When the caller supplied an explicit non-empty --scenario filter,
+        // record the executed (post-selectQaSuiteScenarios-normalized) ids
+        // so the summary matches what actually ran. When the caller passed
+        // nothing or an empty array ("no filter, full lane catalog"),
+        // preserve the unfiltered = null semantic so the summary stays
+        // distinguishable from an explicit all-scenarios selection.
+        scenarioIds:
+          params?.scenarioIds && params.scenarioIds.length > 0
+            ? selectedCatalogScenarios.map((scenario) => scenario.id)
+            : undefined,
      });
      lab.setLatestReport({
        outputPath: reportPath,
@@ -1737,6 +1824,12 @@ export async function runQaSuite(params?: QaSuiteRunParams): Promise<QaSuiteResu
      alternateModel,
      fastMode,
      concurrency,
+      // Same "filtered → executed list, unfiltered → null" convention as
+      // the concurrent-path writeQaSuiteArtifacts call above.
+      scenarioIds:
+        params?.scenarioIds && params.scenarioIds.length > 0
+          ? selectedCatalogScenarios.map((scenario) => scenario.id)
+          : undefined,
    });
    const latestReport = {
      outputPath: reportPath,
--- a/qa/scenarios/config-restart-capability-flip.md
+++ b/qa/scenarios/config-restart-capability-flip.md
@@ -151,6 +151,20 @@ steps:
                    ref: imageStartedAtMs
                  timeoutMs:
                    expr: liveTurnTimeoutMs(env, 45000)
+            # Tool-call assertion (criterion 2 of the parity completion
+            # gate in #64227): the restored `image_generate` capability
+            # must have actually fired as a real tool call. Without this
+            # assertion, a prose reply that just mentions a MEDIA path
+            # could satisfy the scenario, so strengthen it by requiring
+            # the mock to have recorded `plannedToolName: "image_generate"`
+            # against a post-restart request. The `!env.mock || ...`
+            # guard means this check only runs in mock mode (where
+            # `/debug/requests` is available); live-frontier runs skip
+            # it and still pass the rest of the scenario.
+            - assert:
+                expr: "!env.mock || [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].some((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check') && request.plannedToolName === 'image_generate')"
+                message:
+                  expr: "`expected image_generate tool call during capability flip scenario, saw plannedToolNames=${JSON.stringify([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check')).map((request) => request.plannedToolName ?? null))}`"
          finally:
            - call: patchConfig
              args:
--- a/qa/scenarios/image-understanding-attachment.md
+++ b/qa/scenarios/image-understanding-attachment.md
@@ -64,9 +64,26 @@ steps:
          expr: "!missingColorGroup"
          message:
            expr: "`missing expected colors in image description: ${outbound.text}`"
+      # Image-processing assertion: verify the mock actually received an
+      # image on the scenario-unique prompt. This is as strong as a
+      # tool-call assertion for this scenario — unlike the
+      # `source-docs-discovery-report` / `subagent-handoff` /
+      # `config-restart-capability-flip` scenarios that rely on a real
+      # tool call to satisfy the parity criterion, image understanding
+      # is handled inside the provider's vision capability and does NOT
+      # emit a tool call the mock can record as `plannedToolName`. The
+      # `imageInputCount` field IS the tool-call evidence for vision
+      # scenarios: it proves the attachment reached the provider, which
+      # is the only thing an external harness can verify in mock mode.
+      # Match on the scenario-unique prompt substring so the assertion
+      # can't be accidentally satisfied by some other scenario's image
+      # request that happens to share a debug log with this one.
+      - set: imageRequest
+        value:
+          expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].find((request) => String(request.prompt ?? '').includes('Image understanding check')) : null"
      - assert:
-          expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0) >= 1)"
+          expr: "!env.mock || (imageRequest && (imageRequest.imageInputCount ?? 0) >= 1)"
          message:
-            expr: "`expected at least one input image, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0)}`"
+            expr: "`expected at least one input image on the Image understanding check request, got imageInputCount=${String(imageRequest?.imageInputCount ?? 0)}`"
    detailsExpr: outbound.text
 ```
--- a/qa/scenarios/instruction-followthrough-repo-contract.md
+++ b/qa/scenarios/instruction-followthrough-repo-contract.md
@@ -0,0 +1,127 @@
+# Instruction followthrough repo contract
+
+```yaml qa-scenario
+id: instruction-followthrough-repo-contract
+title: Instruction followthrough repo contract
+surface: repo-contract
+objective: Verify the agent reads repo instruction files first, follows the required tool order, and completes the first feasible action instead of stopping at a plan.
+successCriteria:
+  - Agent reads the seeded instruction files before writing the requested artifact.
+  - Agent writes the requested artifact in the same run instead of returning only a plan.
+  - Agent does not ask for permission before the first feasible action.
+  - Final reply makes the completed read/write sequence explicit.
+docsRefs:
+  - docs/help/testing.md
+  - docs/channels/qa-channel.md
+codeRefs:
+  - src/agents/system-prompt.ts
+  - src/agents/pi-embedded-runner/run/incomplete-turn.ts
+  - extensions/qa-lab/src/mock-openai-server.ts
+execution:
+  kind: flow
+  summary: Verify the agent reads repo instructions first, then completes the first bounded followthrough task without stalling.
+  config:
+    workspaceFiles:
+      AGENT.md: |-
+        # Repo contract
+
+        Step order:
+        1. Read AGENT.md.
+        2. Read SOUL.md.
+        3. Read FOLLOWTHROUGH_INPUT.md.
+        4. Write ./repo-contract-summary.txt.
+        5. Reply with three labeled lines exactly once: Read, Wrote, Status.
+
+        Do not stop after planning.
+        Do not ask for permission before the first feasible action.
+      SOUL.md: |-
+        # Execution style
+
+        Stay brief, honest, and action-first.
+        If the next tool action is feasible, do it before replying.
+      FOLLOWTHROUGH_INPUT.md: |-
+        Mission: prove you followed the repo contract.
+        Evidence path: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md -> repo-contract-summary.txt
+    prompt: |-
+      Repo contract followthrough check. Read AGENT.md, SOUL.md, and FOLLOWTHROUGH_INPUT.md first.
+      Then follow the repo contract exactly, write ./repo-contract-summary.txt, and reply with
+      three labeled lines: Read, Wrote, Status.
+      Do not stop after planning and do not ask for permission before the first feasible action.
+    expectedReplyAll:
+      - "read:"
+      - "wrote:"
+      - "status:"
+    forbiddenNeedles:
+      - need permission
+      - need your approval
+      - can you approve
+      - i would
+      - i can
+      - next i would
+```
+
+```yaml qa-flow
+steps:
+  - name: follows repo instructions instead of stopping at a plan
+    actions:
+      - call: reset
+      - forEach:
+          items:
+            expr: "Object.entries(config.workspaceFiles ?? {})"
+          item: workspaceFile
+          actions:
+            - call: fs.writeFile
+              args:
+                - expr: "path.join(env.gateway.workspaceDir, String(workspaceFile[0]))"
+                - expr: "`${String(workspaceFile[1] ?? '').trimEnd()}\\n`"
+                - utf8
+      - set: artifactPath
+        value:
+          expr: "path.join(env.gateway.workspaceDir, 'repo-contract-summary.txt')"
+      - call: runAgentPrompt
+        args:
+          - ref: env
+          - sessionKey: agent:qa:repo-contract
+            message:
+              expr: config.prompt
+            timeoutMs:
+              expr: liveTurnTimeoutMs(env, 40000)
+      - call: waitForCondition
+        saveAs: artifact
+        args:
+          - lambda:
+              async: true
+              expr: "((await fs.readFile(artifactPath, 'utf8').catch(() => null))?.includes('Mission: prove you followed the repo contract.') ? await fs.readFile(artifactPath, 'utf8').catch(() => null) : undefined)"
+          - expr: liveTurnTimeoutMs(env, 30000)
+          - expr: "env.providerMode === 'mock-openai' ? 100 : 250"
+      - set: expectedReplyAll
+        value:
+          expr: config.expectedReplyAll.map(normalizeLowercaseStringOrEmpty)
+      - call: waitForCondition
+        saveAs: outbound
+        args:
+          - lambda:
+              expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && expectedReplyAll.every((needle) => normalizeLowercaseStringOrEmpty(candidate.text).includes(needle))).at(-1)"
+          - expr: liveTurnTimeoutMs(env, 30000)
+          - expr: "env.providerMode === 'mock-openai' ? 100 : 250"
+      - assert:
+          expr: "!config.forbiddenNeedles.some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
+          message:
+            expr: "`repo contract followthrough bounced for permission or stalled: ${outbound.text}`"
+      - set: followthroughDebugRequests
+        value:
+          expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => /repo contract followthrough check/i.test(String(request.allInputText ?? ''))) : []"
+      - assert:
+          expr: "!env.mock || followthroughDebugRequests.filter((request) => request.plannedToolName === 'read').length >= 3"
+          message:
+            expr: "`expected three read tool calls before write, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
+      - assert:
+          expr: "!env.mock || followthroughDebugRequests.some((request) => request.plannedToolName === 'write')"
+          message:
+            expr: "`expected write tool call during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
+      - assert:
+          expr: "!env.mock || (() => { const readIndices = followthroughDebugRequests.map((r, i) => r.plannedToolName === 'read' ? i : -1).filter(i => i >= 0); const firstWrite = followthroughDebugRequests.findIndex((r) => r.plannedToolName === 'write'); return readIndices.length >= 3 && firstWrite >= 0 && readIndices[2] < firstWrite; })()"
+          message:
+            expr: "`expected all 3 reads before any write during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
+    detailsExpr: outbound.text
+```
--- a/qa/scenarios/memory-recall.md
+++ b/qa/scenarios/memory-recall.md
@@ -1,5 +1,36 @@
 # Memory recall after context switch

+<!--
+  This scenario deliberately stays prose-only and does NOT gate on a
+  `/debug/requests` tool-call assertion, even though it is one of the
+  scenarios in the parity pack. The adversarial review in the umbrella
+  #64227 thread called this out as a coverage gap, but the underlying
+  behavior the scenario tests is legitimately prose-shaped: the agent is
+  supposed to pull a prior-turn fact ("ALPHA-7") back across an
+  intervening context switch and reply with the code. In a real
+  conversation, the model can do this EITHER by calling a memory-search
+  tool (which the qa-lab mock server doesn't currently expose) OR by
+  reading the fact directly from prior-turn context in its own
+  conversation window. Both strategies are valid parity behavior.
+
+  Forcing a `plannedToolName` assertion here would either require
+  extending the mock with a synthetic `memory_search` tool lane (PR O
+  scope, not PR J) or fabricating a tool-call requirement the real
+  providers never implement. Either path would make this scenario test
+  the harness, not the models. So we keep it prose-only, covered by the
+  `recallExpectedAny` / `rememberAckAny` assertions above, and flag the
+  exception explicitly rather than silently.
+
+  Criterion 2 of the parity completion gate (no fake progress or fake
+  tool completion) is enforced for this scenario through the parity
+  report's failure-tone fake-success detector: a scenario marked `pass`
+  whose details text matches patterns like "timed out", "failed to",
+  "could not" gets flagged via `SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS`
+  in `extensions/qa-lab/src/agentic-parity-report.ts`. Positive-tone
+  detection was removed because it false-positives on legitimate passes
+  where the details field is the model's outbound prose.
+-->
+
 ```yaml qa-scenario
 id: memory-recall
 title: Memory recall after context switch
--- a/qa/scenarios/model-switch-tool-continuity.md
+++ b/qa/scenarios/model-switch-tool-continuity.md
@@ -69,13 +69,22 @@ steps:
          expr: hasModelSwitchContinuityEvidence(outbound.text)
          message:
            expr: "`switch reply missed kickoff continuity: ${outbound.text}`"
-      - assert:
-          expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName) === 'read')"
-          message:
-            expr: "`expected read after switch, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName ?? '')}`"
-      - assert:
-          expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model) === 'gpt-5.4-alt')"
-          message:
-            expr: "`expected alternate model, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model ?? '')}`"
+      - if:
+          expr: "Boolean(env.mock)"
+          then:
+            - set: switchDebugRequests
+              value:
+                expr: "await fetchJson(`${env.mock.baseUrl}/debug/requests`)"
+            - set: switchRequest
+              value:
+                expr: "switchDebugRequests.find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))"
+            - assert:
+                expr: "switchRequest?.plannedToolName === 'read'"
+                message:
+                  expr: "`expected read after switch, got ${String(switchRequest?.plannedToolName ?? '')}`"
+            - assert:
+                expr: "String(switchRequest?.model ?? '') === String(alternate?.model ?? '')"
+                message:
+                  expr: "`expected alternate model, got ${String(switchRequest?.model ?? '')}`"
    detailsExpr: outbound.text
 ```
--- a/qa/scenarios/source-docs-discovery-report.md
+++ b/qa/scenarios/source-docs-discovery-report.md
@@ -56,5 +56,20 @@ steps:
          expr: "!reportsDiscoveryScopeLeak(outbound.text)"
          message:
            expr: "`discovery report drifted beyond scope: ${outbound.text}`"
+      # Parity gate criterion 2 (no fake progress / fake tool completion):
+      # require an actual read tool call before the prose report. Without this,
+      # a model could fabricate a plausible Worked/Failed/Blocked/Follow-up
+      # report without ever touching the repo files the prompt names. The
+      # debug request log is fetched once and reused for both the assertion
+      # and its failure-message diagnostic. Each request's allInputText is
+      # lowercased inline at match time (the real prompt writes it as
+      # "Worked, Failed, Blocked") so the contains check is case-insensitive.
+      - set: discoveryDebugRequests
+        value:
+          expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
+      - assert:
+          expr: "!env.mock || discoveryDebugRequests.some((request) => String(request.allInputText ?? '').toLowerCase().includes('worked, failed, blocked') && request.plannedToolName === 'read')"
+          message:
+            expr: "`expected at least one read tool call during discovery report scenario, saw plannedToolNames=${JSON.stringify(discoveryDebugRequests.map((request) => request.plannedToolName ?? null))}`"
    detailsExpr: outbound.text
 ```
--- a/qa/scenarios/subagent-fanout-synthesis.md
+++ b/qa/scenarios/subagent-fanout-synthesis.md
@@ -113,6 +113,28 @@ steps:
                                  expr: "sawAlpha && sawBeta"
                                  message:
                                    expr: "`fanout child sessions missing (alpha=${String(sawAlpha)} beta=${String(sawBeta)})`"
+                              # Tool-call assertion (criterion 2 of the
+                              # parity completion gate in #64227): the
+                              # scenario must have actually invoked
+                              # `sessions_spawn` at least twice with
+                              # distinct labels, not just ended up with
+                              # two rows in the session store through
+                              # prose trickery. The session store alone
+                              # can be populated by other flows or by a
+                              # model that fabricates "delegation"
+                              # narration. `plannedToolName` on the
+                              # mock's `/debug/requests` log is the
+                              # tool-call ground truth: two recorded
+                              # sessions_spawn requests with distinct
+                              # labels means the model really dispatched
+                              # both subagents.
+                              - set: fanoutSpawnRequests
+                                value:
+                                  expr: "[...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => request.plannedToolName === 'sessions_spawn' && /subagent fanout synthesis check/i.test(String(request.allInputText ?? '')))"
+                              - assert:
+                                  expr: "fanoutSpawnRequests.length >= 2"
+                                  message:
+                                    expr: "`expected at least two sessions_spawn tool calls during subagent fanout scenario, saw ${fanoutSpawnRequests.length}`"
                        - set: details
                          value:
                            expr: "outbound.text"
--- a/qa/scenarios/subagent-handoff.md
+++ b/qa/scenarios/subagent-handoff.md
@@ -46,5 +46,25 @@ steps:
          expr: "!['failed to delegate','could not delegate','subagent unavailable'].some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
          message:
            expr: "`subagent handoff reported failure: ${outbound.text}`"
+      # Parity gate criterion 2 (no fake progress / fake tool completion):
+      # require an actual sessions_spawn tool call. Without this, a model
+      # could produce the three labeled sections ("Delegated task", "Result",
+      # "Evidence") as free-form prose without ever delegating to a real
+      # subagent. The assertion is pinned to THIS scenario by matching the
+      # scenario-unique prompt substring "Delegate one bounded QA task"
+      # (not a broad /delegate|subagent/ regex) so the earlier
+      # subagent-fanout-synthesis scenario — which also contains "delegate"
+      # and produces its own pre-tool sessions_spawn request — cannot
+      # satisfy the assertion here. The match is also constrained to
+      # pre-tool requests (no toolOutput) because the mock only plans
+      # sessions_spawn on requests with no toolOutput; the follow-up
+      # request after the tool runs has plannedToolName unset.
+      - set: subagentDebugRequests
+        value:
+          expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
+      - assert:
+          expr: "!env.mock || subagentDebugRequests.some((request) => !request.toolOutput && /delegate one bounded qa task/i.test(String(request.allInputText ?? '')) && request.plannedToolName === 'sessions_spawn')"
+          message:
+            expr: "`expected sessions_spawn tool call during subagent handoff scenario, saw plannedToolNames=${JSON.stringify(subagentDebugRequests.map((request) => request.plannedToolName ?? null))}`"
    detailsExpr: outbound.text
 ```
--- a/src/canvas-host/a2ui/.bundle.hash
+++ b/src/canvas-host/a2ui/.bundle.hash
@@ -1 +1 @@
-b92daceecab88cdb1ceeab30a7321399850a1fd13773af22dbb2035d39cdd5f8
+1d087c0991987824d78c8ac4ec2c0e66d661f4bd4afd12b193d66634c69d75a0