mirror of
https://github.com/openclaw/openclaw.git
synced 2026-04-18 15:23:23 +02:00
qa: salvage GPT-5.4 parity proof slice (#65664)
* test(qa): gate parity prose scenarios on real tool calls Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no fake progress / fake tool completion') for the two first/second-wave parity scenarios that can currently pass with a prose-only reply. Background: the scenario framework already exposes tool-call assertions via /debug/requests on the mock server (see approval-turn-tool-followthrough for the pattern). Most parity scenarios use this seam to require a specific plannedToolName, but source-docs-discovery-report and subagent-handoff only checked the assistant's prose text, which means a model could fabricate: - a Worked / Failed / Blocked / Follow-up report without ever calling the read tool on the docs / source files the prompt named - three labeled 'Delegated task', 'Result', 'Evidence' sections without ever calling sessions_spawn to delegate Both gaps are fake-progress loopholes for the parity gate. Changes: - source-docs-discovery-report: require at least one read tool call tied to the 'worked, failed, blocked' prompt in /debug/requests. Failure message dumps the observed plannedToolName list for debugging. - subagent-handoff: require at least one sessions_spawn tool call tied to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same debug-friendly failure message. Both assertions are gated behind !env.mock so they no-op in live-frontier mode where the real provider exposes plannedToolName through a different channel (or not at all). Not touched: memory-recall is also in the parity pack but its pass path is legitimately 'read the fact from prior-turn context'. That is a valid recall strategy, not fake progress, so it is out of scope for this PR. memory-recall's fake-progress story (no real memory_search call) would require bigger mock-server changes and belongs in a follow-up that extends the mock memory pipeline. Validation: - pnpm test extensions/qa-lab/src/scenario-catalog.test.ts Refs #64227 * test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch Addresses loop-6 review feedback on PR #64681: 1. Copilot / Greptile / codex-connector all flagged that the discovery scenario's .includes('worked, failed, blocked') assertion is case-sensitive but the real prompt says 'Worked, Failed, Blocked...', so the mock-mode assertion never matches. Fix: lowercase-normalize allInputText before the contains check. 2. Greptile P2: the expr and message.expr each called fetchJson separately, incurring two round-trips to /debug/requests. Fix: hoist the fetch to a set step (discoveryDebugRequests / subagentDebugRequests) and reuse the snapshot. 3. Copilot: the subagent-handoff assertion scanned the entire request log and matched the first request with 'delegate' in its input text, which could false-pass on a stale prior scenario. Fix: reverse the array and take the most recent matching request instead. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I used on pass 1 usually lands on the FOLLOW-UP request after the mock runs sessions_spawn, not the pre-tool planning request that actually has plannedToolName === 'sessions_spawn'. The mock only plans that tool on requests with !toolOutput (mock-openai-server.ts:662), so the post-tool request has plannedToolName unset and the assertion fails even when the handoff succeeded. Fix: switch the assertion back to a forward .some() match but add a !request.toolOutput filter so the match is pinned to the pre-tool planning phase. The case-insensitive regex, the fetchJson dedupe, and the failure-message diagnostic from pass 1 are unchanged. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): pin subagent-handoff tool-call assertion to scenario prompt Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix filtered to pre-tool requests but still used a broad `/delegate|subagent handoff/i` regex. The `subagent-fanout-synthesis` scenario runs BEFORE `subagent-handoff` in catalog order (scenarios are sorted by path), and the fanout prompt reads 'Subagent fanout synthesis check: delegate exactly two bounded subagents sequentially' — which contains 'delegate' and also plans sessions_spawn pre-tool. That produces a cross-scenario false pass where the fanout's earlier sessions_spawn request satisfies the handoff assertion even when the handoff run never delegates. Fix: tighten the input-text match from `/delegate|subagent handoff/i` to `/delegate one bounded qa task/i`, which is the exact scenario- unique substring from the `subagent-handoff` config.prompt. That pins the assertion to this scenario's request window and closes the cross-scenario false positive. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): align parity assertion comments with actual filter logic Addresses two loop-7 Copilot findings on PR #64681: 1. source-docs-discovery-report.md: the explanatory comment said the debug request log was 'lowercased for case-insensitive matching', but the code actually lowercases each request's allInputText inline inside the .some() predicate, not the discoveryDebugRequests snapshot. Rewrite the comment to describe the inline-lowercase pattern so a future reader matches the code they see. 2. subagent-handoff.md: the comment said the assertion 'must be pinned to THIS scenario's request window' but the implementation actually relies on matching a scenario-unique prompt substring (/delegate one bounded qa task/i), not a request-window. Rewrite the comment to describe the substring pinning and keep the pre-tool filter rationale intact. No runtime change; comment-only fix to keep reviewer expectations aligned with the actual assertion shape. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios * Guard mock-only image parity assertions * Expand agentic parity second wave * test(qa): pad parity suspicious-pass isolation to second wave * qa-lab: parametrize parity report title and drop stale first-wave comment Addresses two loop-7 Copilot findings on PR #64662: 1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a template string that interpolates candidateLabel and baselineLabel, so any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate title in saved reports. Default CLI flags still produce openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair. 2. Stale 'declared first-wave parity scenarios' comment in scopeSummaryToParityPack: the parity pack is now the ten-scenario first-wave+second-wave set (PR D + PR E). Comment updated to drop the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS constant the scope is filtering against. New regression: 'parametrizes the markdown header from the comparison labels' — asserts that non-default labels (openai/gpt-5.4-alt vs openai/gpt-5.4) render in the H1. Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts (13/13 pass). Refs #64227 * qa-lab: fail parity gate on required scenario failures regardless of baseline parity * test(qa): update readable-report test to cover all 10 parity scenarios * qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels * Tighten parity label and scenario checks * fix: tighten parity label provenance checks * fix: scope parity tool-call metrics to tool lanes * Fix parity report label and fake-success checks * fix(qa): tighten parity report edge cases * qa-lab: add Anthropic /v1/messages mock route for parity baseline Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs #64227 Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path. * qa-lab: fix Anthropic tool_result ordering in messages adapter Addresses the loop-6 Copilot / Greptile finding on PR #64685: in `convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were pushed to `items` inside the per-block loop while the surrounding user/assistant message was only pushed after the loop finished. That reordered the function_call_output BEFORE its parent user message whenever a user turn mixed `tool_result` with fresh text/image blocks, which broke `extractToolOutput` (it scans AFTER the last user-role index; function_call_output placed BEFORE that index is invisible to it) and made the downstream scenario dispatcher behave as if no tool output had been returned on mixed-content turns. Fix: buffer `tool_result` and `tool_use` blocks in local arrays during the per-block loop, push the parent role message first (when it has any text/image pieces), then push the accumulated function_call / function_call_output items in original order. tool_result-only user turns still omit the parent message as before, so the non-mixed subagent-fanout-synthesis two-stage flow that already worked keeps working. Regression added: - `places tool_result after the parent user message even in mixed-content turns` — sends a user turn that mixes a `tool_result` block with a trailing fresh text block, then inspects `/debug/last-request` to assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found the function_call_output AFTER the last user index) and `prompt === 'Keep going with the fanout.'` (extractLastUserText picked up the trailing fresh text). Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (19/19 pass). Refs #64227 * qa-lab: reject Anthropic streaming and empty model in messages mock * qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider * Handle invalid Anthropic mock JSON * fix: wire mock parity providers by model ref * fix(qa): support Anthropic message streaming in mock parity lane * qa-lab: record provider/model/mode in qa-suite-summary.json Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in #64227. Background: the parity gate in #64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs #64227 Unblocks the final parity run for #64441 / #64662 by making summaries self-describing. * qa-lab: strengthen qa-suite-summary builder types and empty-array semantics Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689 (re-opened as #64789): 1. P2 codex + Copilot: empty `scenarioIds` array was serialized as `[]` because of a truthiness check. The CLI passes an empty array when --scenario is omitted, so full-suite runs would incorrectly record an explicit empty selection. Fix: switch to a `length > 0` check so '[] or undefined' both encode as `null` in the summary run metadata. 2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate consumers but its return type was `Record<string, unknown>`, which defeated the point of exporting it. Fix: introduce a concrete `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and make the builder return it. Downstream code (parity gate, parity run wrapper) can now import the type and keep consumers type-checked. 3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the `'mock-openai' | 'live-frontier'` string union even though `QaProviderMode` is already imported from model-selection.ts. Fix: reuse `QaProviderMode` so provider-mode additions flow through both types at once. 4. Copilot: test fixtures omitted `steps` from the fake scenario results, creating shape drift with the real suite scenario-result shape. Fix: pad the test fixtures with `steps: []` and tighten the scenarioIds assertion to read `json.run.scenarioIds` directly (the new concrete return type makes the type-cast unnecessary). New regression: `treats an empty scenarioIds array as unspecified (no filter)` — passes `scenarioIds: []` and asserts the summary records `scenarioIds: null`. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: record executed scenarioIds in summary run metadata Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689): `run.scenarioIds` was copied from the raw `params.scenarioIds` caller input, but `runQaSuite` normalizes that input through `selectQaSuiteScenarios` which dedupes via `Set` and reorders the selection to catalog order. When callers repeat --scenario ids or pass them in non-catalog order, the summary metadata drifted from the scenarios actually executed, which can make parity/report tooling treat equivalent runs as different or trust inaccurate provenance. Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass `selectedCatalogScenarios.map(scenario => scenario.id)` instead of `params?.scenarioIds`, so the summary records the post-selection executed list. This also covers the full-suite case automatically (the executed list is the full lane-filtered catalog), giving parity consumers a stable record of exactly which scenarios landed in the run regardless of how the caller phrased the request. buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2 semantics are preserved so the public helper still treats an empty array as 'unspecified' for any future caller that legitimately passes one. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: preserve null scenarioIds for unfiltered suite runs Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix always passed `selectedCatalogScenarios.map(...)` to writeQaSuiteArtifacts, which made unfiltered full-suite runs indistinguishable from an explicit all-scenarios selection in the summary metadata. The 'unfiltered → null' semantic (documented in the buildQaSuiteSummaryJson JSDoc and exercised by the "treats an empty scenarioIds array as unspecified" regression) was lost. Fix: both writeQaSuiteArtifacts call sites now condition on the caller's original `params.scenarioIds`. When the caller passed an explicit non-empty filter, record the post-selection executed list (pass-3 behavior, preserving Set-dedupe + catalog-order normalization). When the caller passed undefined or an empty array, pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's length-check serializes null (pass-2 behavior, preserving unfiltered semantics). This keeps both codex-connector findings satisfied simultaneously: - explicit --scenario filter reorders/dedupes through the executed list, not the raw caller input - unfiltered full-suite run records null, not a full catalog dump that would shadow "explicit all-scenarios" selections Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type * qa-lab: stage mock auth profiles so the parity gate runs without real credentials * fix(qa): clean up mock auth staging follow-ups * ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock * ci: use supported parity gate runner label * ci: watch gateway changes in parity gate * docs: pin parity runbook alternate models * fix(ci): watch qa-channel parity inputs * qa: roll up parity proof closeout * qa: harden mock parity review fixes * qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection * qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch * qa-lab: clean up positive-tone comment + fix stale test expectations * qa: pin workflow Node version to 22.14.0 + fix stale label-match wording * qa-lab: refresh mock provider routing expectation * docs: drop stale parity rollup rewrite from proof slice * qa: run parity gate against mock lane * deps: sync qa-lab lockfile * build: refresh a2ui bundle hash * ci: widen parity gate triggers --------- Co-authored-by: Eva <eva@100yen.org>
This commit is contained in:
93
.github/workflows/parity-gate.yml
vendored
Normal file
93
.github/workflows/parity-gate.yml
vendored
Normal file
@@ -0,0 +1,93 @@
|
||||
name: Parity gate
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
types: [opened, reopened, synchronize, ready_for_review]
|
||||
paths:
|
||||
- "extensions/qa-lab/**"
|
||||
- "extensions/qa-channel/**"
|
||||
- "extensions/openai/**"
|
||||
- "qa/scenarios/**"
|
||||
- "src/agents/**"
|
||||
- "src/context-engine/**"
|
||||
- "src/gateway/**"
|
||||
- "src/media/**"
|
||||
- ".github/workflows/parity-gate.yml"
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
concurrency:
|
||||
group: parity-gate-${{ github.event.pull_request.number || github.sha }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
parity-gate:
|
||||
name: Run the GPT-5.4 / Opus 4.6 parity gate against the qa-lab mock
|
||||
if: ${{ github.event.pull_request.draft != true }}
|
||||
runs-on: blacksmith-8vcpu-ubuntu-2404
|
||||
timeout-minutes: 20
|
||||
env:
|
||||
# Fence the gate off from any real provider credentials. The qa-lab
|
||||
# mock server + auth staging (PR N) should be enough to produce a
|
||||
# meaningful verdict without touching a real API. If any of these
|
||||
# leak into the job env, fail hard instead of silently running
|
||||
# against a live provider and burning real budget.
|
||||
OPENAI_API_KEY: ""
|
||||
ANTHROPIC_API_KEY: ""
|
||||
OPENCLAW_LIVE_OPENAI_KEY: ""
|
||||
OPENCLAW_LIVE_ANTHROPIC_KEY: ""
|
||||
OPENCLAW_LIVE_GEMINI_KEY: ""
|
||||
OPENCLAW_LIVE_SETUP_TOKEN_VALUE: ""
|
||||
steps:
|
||||
- name: Checkout PR
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Install pnpm
|
||||
uses: pnpm/action-setup@v4
|
||||
|
||||
- name: Setup Node
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: "22.14.0"
|
||||
cache: "pnpm"
|
||||
|
||||
- name: Install dependencies
|
||||
run: pnpm install --frozen-lockfile
|
||||
|
||||
- name: Run GPT-5.4 lane
|
||||
run: |
|
||||
pnpm openclaw qa suite \
|
||||
--provider-mode mock-openai \
|
||||
--parity-pack agentic \
|
||||
--model openai/gpt-5.4 \
|
||||
--alt-model openai/gpt-5.4-alt \
|
||||
--output-dir .artifacts/qa-e2e/gpt54
|
||||
|
||||
- name: Run Opus 4.6 lane
|
||||
run: |
|
||||
pnpm openclaw qa suite \
|
||||
--provider-mode mock-openai \
|
||||
--parity-pack agentic \
|
||||
--model anthropic/claude-opus-4-6 \
|
||||
--alt-model anthropic/claude-sonnet-4-6 \
|
||||
--output-dir .artifacts/qa-e2e/opus46
|
||||
|
||||
- name: Generate parity report
|
||||
run: |
|
||||
pnpm openclaw qa parity-report \
|
||||
--repo-root . \
|
||||
--candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
|
||||
--baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
|
||||
--candidate-label openai/gpt-5.4 \
|
||||
--baseline-label anthropic/claude-opus-4-6 \
|
||||
--output-dir .artifacts/qa-e2e/parity
|
||||
|
||||
- name: Upload parity artifacts
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: parity-gate-${{ github.event.pull_request.number || github.sha }}
|
||||
path: .artifacts/qa-e2e/
|
||||
retention-days: 14
|
||||
if-no-files-found: warn
|
||||
@@ -2,16 +2,42 @@ import { describe, expect, it } from "vitest";
|
||||
import {
|
||||
buildQaAgenticParityComparison,
|
||||
computeQaAgenticParityMetrics,
|
||||
QaParityLabelMismatchError,
|
||||
renderQaAgenticParityMarkdownReport,
|
||||
type QaParityReportScenario,
|
||||
type QaParitySuiteSummary,
|
||||
} from "./agentic-parity-report.js";
|
||||
|
||||
const FULL_PARITY_PASS_SCENARIOS: QaParityReportScenario[] = [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" as const },
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" as const },
|
||||
{ name: "Model switch with tool continuity", status: "pass" as const },
|
||||
{ name: "Source and docs discovery report", status: "pass" as const },
|
||||
{ name: "Image understanding from attachment", status: "pass" as const },
|
||||
{ name: "Subagent handoff", status: "pass" as const },
|
||||
{ name: "Subagent fanout synthesis", status: "pass" as const },
|
||||
{ name: "Memory recall after context switch", status: "pass" as const },
|
||||
{ name: "Thread memory isolation", status: "pass" as const },
|
||||
{ name: "Config restart capability flip", status: "pass" as const },
|
||||
{ name: "Instruction followthrough repo contract", status: "pass" as const },
|
||||
];
|
||||
|
||||
function withScenarioOverride(name: string, override: Partial<QaParityReportScenario>) {
|
||||
return FULL_PARITY_PASS_SCENARIOS.map((scenario) =>
|
||||
scenario.name === name ? { ...scenario, ...override } : scenario,
|
||||
);
|
||||
}
|
||||
|
||||
describe("qa agentic parity report", () => {
|
||||
it("computes first-wave parity metrics from suite summaries", () => {
|
||||
const summary: QaParitySuiteSummary = {
|
||||
scenarios: [
|
||||
{ name: "Scenario A", status: "pass" },
|
||||
{ name: "Scenario B", status: "fail", details: "incomplete turn detected" },
|
||||
{ name: "Approval turn tool followthrough", status: "pass" },
|
||||
{
|
||||
name: "Compaction retry after mutating tool",
|
||||
status: "fail",
|
||||
details: "incomplete turn detected",
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
@@ -28,6 +54,23 @@ describe("qa agentic parity report", () => {
|
||||
});
|
||||
});
|
||||
|
||||
it("keeps non-tool scenarios out of the valid-tool-call metric", () => {
|
||||
const summary: QaParitySuiteSummary = {
|
||||
scenarios: [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" },
|
||||
{ name: "Memory recall after context switch", status: "pass" },
|
||||
{ name: "Image understanding from attachment", status: "pass" },
|
||||
],
|
||||
};
|
||||
|
||||
expect(computeQaAgenticParityMetrics(summary)).toMatchObject({
|
||||
totalScenarios: 3,
|
||||
passedScenarios: 3,
|
||||
validToolCallCount: 1,
|
||||
validToolCallRate: 1,
|
||||
});
|
||||
});
|
||||
|
||||
it("fails the parity gate when the candidate regresses against baseline", () => {
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
@@ -207,33 +250,70 @@ describe("qa agentic parity report", () => {
|
||||
);
|
||||
});
|
||||
|
||||
it("fails the parity gate when a required parity scenario fails on both sides", () => {
|
||||
// Regression for the loop-7 Codex-connector P1 finding: without this
|
||||
// check, a required parity scenario that fails on both candidate and
|
||||
// baseline still produces pass=true because the downstream metric
|
||||
// comparisons are purely relative (candidate vs baseline). Cover the
|
||||
// whole parity pack as pass on both sides except the one scenario we
|
||||
// deliberately fail on both sides, so the assertion can pin the
|
||||
// isolated gate failure under test.
|
||||
const scenariosWithBothFail = withScenarioOverride("Approval turn tool followthrough", {
|
||||
status: "fail",
|
||||
});
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: { scenarios: scenariosWithBothFail },
|
||||
baselineSummary: { scenarios: scenariosWithBothFail },
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
|
||||
expect(comparison.pass).toBe(false);
|
||||
expect(comparison.failures).toContain(
|
||||
"Required parity scenario Approval turn tool followthrough failed: openai/gpt-5.4=fail, anthropic/claude-opus-4-6=fail.",
|
||||
);
|
||||
// Metric comparisons are relative, so a same-on-both-sides failure
|
||||
// must not appear as a relative metric failure. The required-scenario
|
||||
// failure line is the only thing keeping the gate honest here.
|
||||
expect(comparison.failures.some((failure) => failure.includes("completion rate"))).toBe(false);
|
||||
});
|
||||
|
||||
it("fails the parity gate when a required parity scenario fails on the candidate only", () => {
|
||||
// A candidate regression below a passing baseline is already caught
|
||||
// by the relative completion-rate comparison, but surface it as a
|
||||
// named required-scenario failure too so operators see a concrete
|
||||
// scenario name alongside the rate differential.
|
||||
const candidateWithOneFail = withScenarioOverride("Approval turn tool followthrough", {
|
||||
status: "fail",
|
||||
});
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: { scenarios: candidateWithOneFail },
|
||||
baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
|
||||
expect(comparison.pass).toBe(false);
|
||||
expect(comparison.failures).toContain(
|
||||
"Required parity scenario Approval turn tool followthrough failed: openai/gpt-5.4=fail, anthropic/claude-opus-4-6=pass.",
|
||||
);
|
||||
});
|
||||
|
||||
it("fails the parity gate when the baseline contains suspicious pass results", () => {
|
||||
// Cover the full first-wave pack on both sides so the suspicious-pass assertion
|
||||
// Cover the full second-wave pack on both sides so the suspicious-pass assertion
|
||||
// below is the isolated gate failure under test (no coverage-gap noise).
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: {
|
||||
scenarios: [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" },
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" },
|
||||
{ name: "Model switch with tool continuity", status: "pass" },
|
||||
{ name: "Source and docs discovery report", status: "pass" },
|
||||
{ name: "Image understanding from attachment", status: "pass" },
|
||||
],
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: [
|
||||
{
|
||||
name: "Approval turn tool followthrough",
|
||||
status: "pass",
|
||||
details: "timed out before it continued",
|
||||
},
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" },
|
||||
{ name: "Model switch with tool continuity", status: "pass" },
|
||||
{ name: "Source and docs discovery report", status: "pass" },
|
||||
{ name: "Image understanding from attachment", status: "pass" },
|
||||
],
|
||||
scenarios: withScenarioOverride("Approval turn tool followthrough", {
|
||||
details: "timed out before it continued",
|
||||
}),
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
@@ -303,36 +383,333 @@ Follow-up:
|
||||
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(1);
|
||||
});
|
||||
|
||||
it("renders a readable markdown parity report", () => {
|
||||
it("does not flag positive-tone prose as fake success (positive-tone detection removed)", () => {
|
||||
// Positive-tone detection was removed because for passing runs the
|
||||
// `details` field is the model's prose, which never contains tool-call
|
||||
// evidence. Criterion 2 is enforced by per-scenario tool-call assertions.
|
||||
const summary: QaParitySuiteSummary = {
|
||||
scenarios: [
|
||||
{
|
||||
name: "Subagent handoff",
|
||||
status: "pass",
|
||||
details: "Successfully completed the delegation. The subagent returned its result.",
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
|
||||
});
|
||||
|
||||
it("does not flag bare 'Done.' prose as fake success", () => {
|
||||
const summary: QaParitySuiteSummary = {
|
||||
scenarios: [
|
||||
{
|
||||
name: "Approval turn tool followthrough",
|
||||
status: "pass",
|
||||
details: "Done.",
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
|
||||
});
|
||||
|
||||
it("does not flag structured status lines that end in `done`", () => {
|
||||
const summary: QaParitySuiteSummary = {
|
||||
scenarios: [
|
||||
{
|
||||
name: "Compaction retry after mutating tool",
|
||||
status: "pass",
|
||||
details: `Confirmed, replay unsafe after write.
|
||||
compactionCount=0
|
||||
status=done`,
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
|
||||
});
|
||||
|
||||
it("does not flag positive-tone passes when the scenario shows real tool-call evidence", () => {
|
||||
// A legitimate tool-mediated pass that happens to include
|
||||
// "successfully" in its prose must not be flagged. The
|
||||
// `plannedToolName` evidence (or any of the other tool-call
|
||||
// evidence patterns) exempts the scenario from positive-tone
|
||||
// detection. Without this exemption, real tool-backed passes with
|
||||
// self-congratulatory prose would count as fake successes and break
|
||||
// the gate.
|
||||
const summary: QaParitySuiteSummary = {
|
||||
scenarios: [
|
||||
{
|
||||
name: "Source and docs discovery report",
|
||||
status: "pass",
|
||||
details:
|
||||
"Successfully completed the report. plannedToolName=read recorded via /debug/requests.",
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(0);
|
||||
});
|
||||
|
||||
it("only flags failure-tone passes, not positive-tone", () => {
|
||||
const summary: QaParitySuiteSummary = {
|
||||
scenarios: [
|
||||
{
|
||||
name: "Approval turn tool followthrough",
|
||||
status: "pass",
|
||||
details: "Task executed successfully without errors.",
|
||||
},
|
||||
{
|
||||
name: "Subagent handoff",
|
||||
status: "pass",
|
||||
details: "Tool call completed, but an error occurred mid-turn.",
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
// Only the failure-tone scenario ("error occurred") counts.
|
||||
// The positive-tone one ("successfully") is not flagged.
|
||||
expect(computeQaAgenticParityMetrics(summary).fakeSuccessCount).toBe(1);
|
||||
});
|
||||
|
||||
it("throws QaParityLabelMismatchError when the candidate run.primaryProvider does not match the label", () => {
|
||||
// Regression for the gate footgun: if an operator swaps the
|
||||
// --candidate-summary and --baseline-summary paths, the gate would
|
||||
// silently produce a reversed verdict. PR L #64789 ships the `run`
|
||||
// block on every summary so the parity report can verify it against
|
||||
// the caller-supplied label; this test pins the precondition check.
|
||||
const parityPassScenarios = [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" as const },
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" as const },
|
||||
{ name: "Model switch with tool continuity", status: "pass" as const },
|
||||
{ name: "Source and docs discovery report", status: "pass" as const },
|
||||
{ name: "Image understanding from attachment", status: "pass" as const },
|
||||
];
|
||||
|
||||
expect(() =>
|
||||
buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: {
|
||||
scenarios: parityPassScenarios,
|
||||
run: { primaryProvider: "anthropic", primaryModel: "claude-opus-4-6" },
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: parityPassScenarios,
|
||||
run: { primaryProvider: "anthropic", primaryModel: "claude-opus-4-6" },
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
}),
|
||||
).toThrow(QaParityLabelMismatchError);
|
||||
});
|
||||
|
||||
it("throws QaParityLabelMismatchError when the baseline run.primaryProvider does not match the label", () => {
|
||||
const parityPassScenarios = [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" as const },
|
||||
];
|
||||
|
||||
expect(() =>
|
||||
buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: {
|
||||
scenarios: parityPassScenarios,
|
||||
run: { primaryProvider: "openai" },
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: parityPassScenarios,
|
||||
run: { primaryProvider: "openai", primaryModel: "gpt-5.4" },
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
}),
|
||||
).toThrow(
|
||||
/baseline summary run\.primaryProvider=openai and run\.primaryModel=gpt-5\.4 do not match --baseline-label/,
|
||||
);
|
||||
});
|
||||
|
||||
it("accepts matching run.primaryProvider labels without throwing", () => {
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: {
|
||||
scenarios: [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" },
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" },
|
||||
{ name: "Model switch with tool continuity", status: "pass" },
|
||||
{ name: "Source and docs discovery report", status: "pass" },
|
||||
{ name: "Image understanding from attachment", status: "pass" },
|
||||
],
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "openai",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
primaryModelName: "gpt-5.4",
|
||||
},
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" },
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" },
|
||||
{ name: "Model switch with tool continuity", status: "pass" },
|
||||
{ name: "Source and docs discovery report", status: "pass" },
|
||||
{ name: "Image understanding from attachment", status: "pass" },
|
||||
],
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "anthropic",
|
||||
primaryModel: "anthropic/claude-opus-4-6",
|
||||
primaryModelName: "claude-opus-4-6",
|
||||
},
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
expect(comparison.pass).toBe(true);
|
||||
});
|
||||
|
||||
it("skips run.primaryProvider verification when the summary is missing a run block (legacy summaries)", () => {
|
||||
// Pre-PR-L summaries don't carry a `run` block. The gate must still
|
||||
// work against those, trusting the caller-supplied label.
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
|
||||
baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
expect(comparison.pass).toBe(true);
|
||||
});
|
||||
|
||||
it("skips provider verification for arbitrary display labels when run metadata is present", () => {
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "GPT-5.4 candidate",
|
||||
baselineLabel: "Opus 4.6 baseline",
|
||||
candidateSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "openai",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
primaryModelName: "gpt-5.4",
|
||||
},
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "anthropic",
|
||||
primaryModel: "anthropic/claude-opus-4-6",
|
||||
primaryModelName: "claude-opus-4-6",
|
||||
},
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
|
||||
expect(comparison.pass).toBe(true);
|
||||
});
|
||||
|
||||
it("skips provider verification for mixed-case or decorated display labels", () => {
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "Candidate: GPT-5.4",
|
||||
baselineLabel: "Opus 4.6 / baseline",
|
||||
candidateSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "openai",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
primaryModelName: "gpt-5.4",
|
||||
},
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "anthropic",
|
||||
primaryModel: "anthropic/claude-opus-4-6",
|
||||
primaryModelName: "claude-opus-4-6",
|
||||
},
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
|
||||
expect(comparison.pass).toBe(true);
|
||||
});
|
||||
|
||||
it("throws when a structured label mismatches the recorded model even if the provider matches", () => {
|
||||
expect(() =>
|
||||
buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "openai",
|
||||
primaryModel: "openai/gpt-5.4-alt",
|
||||
primaryModelName: "gpt-5.4-alt",
|
||||
},
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "anthropic",
|
||||
primaryModel: "anthropic/claude-opus-4-6",
|
||||
primaryModelName: "claude-opus-4-6",
|
||||
},
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
}),
|
||||
).toThrow(
|
||||
/candidate summary run\.primaryProvider=openai and run\.primaryModel=openai\/gpt-5\.4-alt do not match --candidate-label=openai\/gpt-5\.4/,
|
||||
);
|
||||
});
|
||||
|
||||
it("accepts colon-delimited structured labels when provider and model both match", () => {
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai:gpt-5.4",
|
||||
baselineLabel: "anthropic:claude-opus-4-6",
|
||||
candidateSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "openai",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
primaryModelName: "gpt-5.4",
|
||||
},
|
||||
},
|
||||
baselineSummary: {
|
||||
scenarios: FULL_PARITY_PASS_SCENARIOS,
|
||||
run: {
|
||||
primaryProvider: "anthropic",
|
||||
primaryModel: "anthropic/claude-opus-4-6",
|
||||
primaryModelName: "claude-opus-4-6",
|
||||
},
|
||||
},
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
|
||||
expect(comparison.pass).toBe(true);
|
||||
});
|
||||
|
||||
it("renders a readable markdown parity report", () => {
|
||||
// Cover the full parity pack on both sides so the pass
|
||||
// verdict is not disrupted by required-scenario coverage failures
|
||||
// added by the second-wave expansion.
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4",
|
||||
baselineLabel: "anthropic/claude-opus-4-6",
|
||||
candidateSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
|
||||
baselineSummary: { scenarios: FULL_PARITY_PASS_SCENARIOS },
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
|
||||
const report = renderQaAgenticParityMarkdownReport(comparison);
|
||||
|
||||
expect(report).toContain("# OpenClaw GPT-5.4 / Opus 4.6 Agentic Parity Report");
|
||||
expect(report).toContain(
|
||||
"# OpenClaw Agentic Parity Report — openai/gpt-5.4 vs anthropic/claude-opus-4-6",
|
||||
);
|
||||
expect(report).toContain("| Completion rate | 100.0% | 100.0% |");
|
||||
expect(report).toContain("### Approval turn tool followthrough");
|
||||
expect(report).toContain("- Verdict: pass");
|
||||
});
|
||||
|
||||
it("parametrizes the markdown header from the comparison labels", () => {
|
||||
// Regression for the loop-7 Copilot finding: callers that configure
|
||||
// non-gpt-5.4 / non-opus labels (for example an internal candidate vs
|
||||
// another candidate) must see the labels in the rendered H1 instead of
|
||||
// the hardcoded "GPT-5.4 / Opus 4.6" title that would otherwise confuse
|
||||
// readers of saved reports.
|
||||
const comparison = buildQaAgenticParityComparison({
|
||||
candidateLabel: "openai/gpt-5.4-alt",
|
||||
baselineLabel: "openai/gpt-5.4",
|
||||
candidateSummary: { scenarios: [] },
|
||||
baselineSummary: { scenarios: [] },
|
||||
comparedAt: "2026-04-11T00:00:00.000Z",
|
||||
});
|
||||
const report = renderQaAgenticParityMarkdownReport(comparison);
|
||||
expect(report).toContain(
|
||||
"# OpenClaw Agentic Parity Report — openai/gpt-5.4-alt vs openai/gpt-5.4",
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
@@ -1,4 +1,7 @@
|
||||
import { QA_AGENTIC_PARITY_SCENARIO_TITLES } from "./agentic-parity.js";
|
||||
import {
|
||||
QA_AGENTIC_PARITY_SCENARIO_TITLES,
|
||||
QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES,
|
||||
} from "./agentic-parity.js";
|
||||
|
||||
export type QaParityReportStep = {
|
||||
name: string;
|
||||
@@ -13,6 +16,21 @@ export type QaParityReportScenario = {
|
||||
steps?: QaParityReportStep[];
|
||||
};
|
||||
|
||||
/**
|
||||
* Optional self-describing run metadata written by PR L (#64789). Before
|
||||
* that PR merges, older summaries only have `scenarios` + `counts`; the
|
||||
* parity report treats a missing `run` block as "unknown provenance" and
|
||||
* skips the label-match verification for backwards compatibility
|
||||
* with legacy summaries that predate the run metadata block.
|
||||
*/
|
||||
export type QaParityRunBlock = {
|
||||
primaryProvider?: string;
|
||||
primaryModel?: string;
|
||||
primaryModelName?: string;
|
||||
providerMode?: string;
|
||||
scenarioIds?: readonly string[] | null;
|
||||
};
|
||||
|
||||
export type QaParitySuiteSummary = {
|
||||
scenarios: QaParityReportScenario[];
|
||||
counts?: {
|
||||
@@ -20,6 +38,8 @@ export type QaParitySuiteSummary = {
|
||||
passed?: number;
|
||||
failed?: number;
|
||||
};
|
||||
/** Self-describing run metadata — see PR L #64789 for the writer side. */
|
||||
run?: QaParityRunBlock;
|
||||
};
|
||||
|
||||
export type QaAgenticParityMetrics = {
|
||||
@@ -64,7 +84,11 @@ const UNINTENDED_STOP_PATTERNS = [
|
||||
/did not continue/i,
|
||||
] as const;
|
||||
|
||||
const SUSPICIOUS_PASS_PATTERNS = [
|
||||
// Failure-tone patterns: a passing scenario whose details text matches any
|
||||
// of these is treated as a "fake success" — the scenario is marked pass but
|
||||
// the supporting text reveals something went wrong. Adding new patterns here
|
||||
// widens the net for bad prose that correlates with runtime failure modes.
|
||||
const SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS = [
|
||||
/incomplete turn/i,
|
||||
/\btimed out\b/i,
|
||||
/\btimeout\b/i,
|
||||
@@ -76,6 +100,13 @@ const SUSPICIOUS_PASS_PATTERNS = [
|
||||
/an error was/i,
|
||||
] as const;
|
||||
|
||||
// Positive-tone patterns (e.g. "Successfully completed", "Done.") are NOT
|
||||
// checked in fakeSuccessCount. For passing runs, `details` is the model's
|
||||
// outbound prose, which never contains tool-call evidence strings, so a
|
||||
// tool-call-evidence exemption would false-positive on every legitimate
|
||||
// pass. Criterion 2 ("no fake progress") is enforced by per-scenario
|
||||
// `/debug/requests` tool-call assertions in the YAML flows (PR J) instead.
|
||||
|
||||
function normalizeScenarioStatus(status: string | undefined): "pass" | "fail" | "skip" {
|
||||
return status === "pass" || status === "fail" || status === "skip" ? status : "fail";
|
||||
}
|
||||
@@ -103,6 +134,9 @@ export function computeQaAgenticParityMetrics(
|
||||
...scenario,
|
||||
status: normalizeScenarioStatus(scenario.status),
|
||||
}));
|
||||
const toolBackedTitleSet: ReadonlySet<string> = new Set(
|
||||
QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES,
|
||||
);
|
||||
const totalScenarios = summary.counts?.total ?? scenarios.length;
|
||||
const passedScenarios =
|
||||
summary.counts?.passed ?? scenarios.filter((scenario) => scenario.status === "pass").length;
|
||||
@@ -112,16 +146,40 @@ export function computeQaAgenticParityMetrics(
|
||||
(scenario) =>
|
||||
scenario.status !== "pass" && scenarioHasPattern(scenario, UNINTENDED_STOP_PATTERNS),
|
||||
).length;
|
||||
const fakeSuccessCount = scenarios.filter(
|
||||
(scenario) =>
|
||||
scenario.status === "pass" && scenarioHasPattern(scenario, SUSPICIOUS_PASS_PATTERNS),
|
||||
const fakeSuccessCount = scenarios.filter((scenario) => {
|
||||
if (scenario.status !== "pass") {
|
||||
return false;
|
||||
}
|
||||
// Failure-tone patterns catch obviously-broken passes regardless of
|
||||
// whether the scenario shows tool-call evidence — "timed out" under a
|
||||
// pass is always fake.
|
||||
if (scenarioHasPattern(scenario, SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS)) {
|
||||
return true;
|
||||
}
|
||||
// Positive-tone patterns (like "Successfully completed") are NOT checked
|
||||
// here because for passing runs the `details` field is the model's
|
||||
// outbound prose, which never contains tool-call evidence strings.
|
||||
// The `scenarioLacksToolCallEvidence` check would return true for ALL
|
||||
// passes and false-positive on legitimate completions. Criterion 2
|
||||
// ("no fake tool completion") is instead enforced by the per-scenario
|
||||
// `/debug/requests` tool-call assertions from the scenario YAML flows.
|
||||
return false;
|
||||
}).length;
|
||||
|
||||
// Count only the scenarios that are supposed to exercise a real tool,
|
||||
// subagent, or capability invocation. Memory recall and image-only
|
||||
// understanding lanes stay in the parity pack, but they should not inflate
|
||||
// the tool-call metric just by passing.
|
||||
const toolBackedScenarioCount = scenarios.filter((scenario) =>
|
||||
toolBackedTitleSet.has(scenario.name),
|
||||
).length;
|
||||
const validToolCallCount = scenarios.filter(
|
||||
(scenario) => toolBackedTitleSet.has(scenario.name) && scenario.status === "pass",
|
||||
).length;
|
||||
|
||||
// First-wave parity scenarios are all tool-mediated tasks, so a passing scenario is our
|
||||
// verified unit of valid tool-backed execution in this harness.
|
||||
const validToolCallCount = passedScenarios;
|
||||
|
||||
const rate = (value: number) => (totalScenarios > 0 ? value / totalScenarios : 0);
|
||||
const toolRate = (value: number) =>
|
||||
toolBackedScenarioCount > 0 ? value / toolBackedScenarioCount : 0;
|
||||
return {
|
||||
totalScenarios,
|
||||
passedScenarios,
|
||||
@@ -130,7 +188,7 @@ export function computeQaAgenticParityMetrics(
|
||||
unintendedStopCount,
|
||||
unintendedStopRate: rate(unintendedStopCount),
|
||||
validToolCallCount,
|
||||
validToolCallRate: rate(validToolCallCount),
|
||||
validToolCallRate: toolRate(validToolCallCount),
|
||||
fakeSuccessCount,
|
||||
};
|
||||
}
|
||||
@@ -149,14 +207,116 @@ function scopeSummaryToParityPack(
|
||||
summary: QaParitySuiteSummary,
|
||||
parityTitleSet: ReadonlySet<string>,
|
||||
): QaParitySuiteSummary {
|
||||
// The parity verdict must only consider the declared first-wave parity scenarios.
|
||||
// Drop `counts` so the metric helper recomputes totals from the filtered scenario
|
||||
// list instead of inheriting the caller's full-suite counters.
|
||||
// The parity verdict must only consider the declared parity scenarios
|
||||
// (the full first-wave + second-wave pack from QA_AGENTIC_PARITY_SCENARIOS).
|
||||
// Drop `counts` so the metric helper recomputes totals from the filtered
|
||||
// scenario list instead of inheriting the caller's full-suite counters.
|
||||
return {
|
||||
scenarios: summary.scenarios.filter((scenario) => parityTitleSet.has(scenario.name)),
|
||||
...(summary.run ? { run: summary.run } : {}),
|
||||
};
|
||||
}
|
||||
|
||||
type StructuredQaParityLabel = {
|
||||
provider: string;
|
||||
model: string;
|
||||
};
|
||||
|
||||
/**
|
||||
* Only treat caller labels as provenance-checked identifiers when they are
|
||||
* exact lower-case provider/model refs. Human-facing display labels like
|
||||
* "GPT-5.4 candidate" or "Candidate: GPT-5.4" should render in the report
|
||||
* without being misread as structured provider ids.
|
||||
*/
|
||||
function parseStructuredLabelRef(label: string): StructuredQaParityLabel | null {
|
||||
const trimmed = label.trim();
|
||||
if (trimmed.length === 0) {
|
||||
return null;
|
||||
}
|
||||
if (trimmed !== trimmed.toLowerCase()) {
|
||||
return null;
|
||||
}
|
||||
const separatorMatch = /^([a-z0-9][a-z0-9-]*)[/:]([a-z0-9][a-z0-9._-]*)$/.exec(trimmed);
|
||||
if (!separatorMatch) {
|
||||
return null;
|
||||
}
|
||||
return {
|
||||
provider: separatorMatch[1] ?? "",
|
||||
model: separatorMatch[2] ?? "",
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Verify the `run.primaryProvider` + `run.primaryModel` fields on a summary
|
||||
* match the caller-supplied label when that label is a structured
|
||||
* `provider/model` or `provider:model` ref. PR L #64789 ships the `run`
|
||||
* block; before it lands, older summaries don't have the field and this check
|
||||
* is a no-op.
|
||||
*
|
||||
* Throws `QaParityLabelMismatchError` when the summary reports a different
|
||||
* provider/model than the caller claimed — this catches the "swapped
|
||||
* candidate and baseline summary paths" footgun the earlier adversarial
|
||||
* review flagged. Returns silently when the fields are absent (legacy
|
||||
* summaries) or when the fields match.
|
||||
*/
|
||||
function verifySummaryLabelMatch(params: {
|
||||
summary: QaParitySuiteSummary;
|
||||
label: string;
|
||||
role: "candidate" | "baseline";
|
||||
}): void {
|
||||
const runProvider = params.summary.run?.primaryProvider?.trim();
|
||||
const runModel = params.summary.run?.primaryModel?.trim();
|
||||
const runModelName = params.summary.run?.primaryModelName?.trim();
|
||||
if (!runProvider || !runModel) {
|
||||
return;
|
||||
}
|
||||
const labelRef = parseStructuredLabelRef(params.label);
|
||||
if (!labelRef) {
|
||||
return;
|
||||
}
|
||||
const normalizedRunModel = runModel.toLowerCase();
|
||||
const normalizedRunModelName = runModelName?.toLowerCase();
|
||||
const normalizedLabelModel = labelRef.model;
|
||||
if (
|
||||
runProvider.toLowerCase() === labelRef.provider &&
|
||||
(normalizedRunModel === normalizedLabelModel ||
|
||||
normalizedRunModelName === normalizedLabelModel ||
|
||||
normalizedRunModel === `${labelRef.provider}/${normalizedLabelModel}`)
|
||||
) {
|
||||
return;
|
||||
}
|
||||
throw new QaParityLabelMismatchError({
|
||||
role: params.role,
|
||||
label: params.label,
|
||||
runProvider,
|
||||
runModel,
|
||||
});
|
||||
}
|
||||
|
||||
export class QaParityLabelMismatchError extends Error {
|
||||
readonly role: "candidate" | "baseline";
|
||||
readonly label: string;
|
||||
readonly runProvider: string;
|
||||
readonly runModel: string;
|
||||
|
||||
constructor(params: {
|
||||
role: "candidate" | "baseline";
|
||||
label: string;
|
||||
runProvider: string;
|
||||
runModel: string;
|
||||
}) {
|
||||
super(
|
||||
`${params.role} summary run.primaryProvider=${params.runProvider} and run.primaryModel=${params.runModel} do not match --${params.role}-label=${params.label}. ` +
|
||||
`Check that the --candidate-summary / --baseline-summary paths weren't swapped.`,
|
||||
);
|
||||
this.name = "QaParityLabelMismatchError";
|
||||
this.role = params.role;
|
||||
this.label = params.label;
|
||||
this.runProvider = params.runProvider;
|
||||
this.runModel = params.runModel;
|
||||
}
|
||||
}
|
||||
|
||||
export function buildQaAgenticParityComparison(params: {
|
||||
candidateLabel: string;
|
||||
baselineLabel: string;
|
||||
@@ -164,6 +324,22 @@ export function buildQaAgenticParityComparison(params: {
|
||||
baselineSummary: QaParitySuiteSummary;
|
||||
comparedAt?: string;
|
||||
}): QaAgenticParityComparison {
|
||||
// Precondition: verify the `run.primaryProvider` field on each summary
|
||||
// matches the caller-supplied label (when the `run` block is present).
|
||||
// Throws `QaParityLabelMismatchError` on mismatch so the release gate
|
||||
// fails loudly instead of silently producing a reversed verdict when an
|
||||
// operator swaps the --candidate-summary and --baseline-summary paths.
|
||||
// Legacy summaries without a `run` block are accepted as-is.
|
||||
verifySummaryLabelMatch({
|
||||
summary: params.candidateSummary,
|
||||
label: params.candidateLabel,
|
||||
role: "candidate",
|
||||
});
|
||||
verifySummaryLabelMatch({
|
||||
summary: params.baselineSummary,
|
||||
label: params.baselineLabel,
|
||||
role: "baseline",
|
||||
});
|
||||
const parityTitleSet: ReadonlySet<string> = new Set<string>(QA_AGENTIC_PARITY_SCENARIO_TITLES);
|
||||
// Rates and fake-success counts are computed from the parity-scoped summaries only,
|
||||
// so extra non-parity scenarios in the input (for example when a caller feeds a full
|
||||
@@ -203,7 +379,7 @@ export function buildQaAgenticParityComparison(params: {
|
||||
});
|
||||
|
||||
const failures: string[] = [];
|
||||
const requiredScenarioCoverage = QA_AGENTIC_PARITY_SCENARIO_TITLES.map((name) => {
|
||||
const requiredScenarioStatuses = QA_AGENTIC_PARITY_SCENARIO_TITLES.map((name) => {
|
||||
const candidate = candidateByName.get(name);
|
||||
const baseline = baselineByName.get(name);
|
||||
return {
|
||||
@@ -211,7 +387,8 @@ export function buildQaAgenticParityComparison(params: {
|
||||
candidateStatus: requiredCoverageStatus(candidate),
|
||||
baselineStatus: requiredCoverageStatus(baseline),
|
||||
};
|
||||
}).filter(
|
||||
});
|
||||
const requiredScenarioCoverage = requiredScenarioStatuses.filter(
|
||||
(scenario) =>
|
||||
scenario.candidateStatus === "missing" ||
|
||||
scenario.baselineStatus === "missing" ||
|
||||
@@ -223,6 +400,26 @@ export function buildQaAgenticParityComparison(params: {
|
||||
`Missing required parity scenario coverage for ${scenario.name}: ${params.candidateLabel}=${scenario.candidateStatus}, ${params.baselineLabel}=${scenario.baselineStatus}.`,
|
||||
);
|
||||
}
|
||||
// Required parity scenarios that ran on both sides but FAILED also fail
|
||||
// the gate. Without this check, a run where both models fail the same
|
||||
// required scenarios still produced pass=true, because the downstream
|
||||
// metric comparisons are purely relative (candidate vs baseline) and
|
||||
// the suspicious-pass fake-success check only catches passes that carry
|
||||
// failure-sounding details. Excluding missing/skip here keeps operator
|
||||
// output from double-counting the same scenario with two lines.
|
||||
const requiredScenarioFailures = requiredScenarioStatuses.filter(
|
||||
(scenario) =>
|
||||
scenario.candidateStatus !== "missing" &&
|
||||
scenario.baselineStatus !== "missing" &&
|
||||
scenario.candidateStatus !== "skip" &&
|
||||
scenario.baselineStatus !== "skip" &&
|
||||
(scenario.candidateStatus === "fail" || scenario.baselineStatus === "fail"),
|
||||
);
|
||||
for (const scenario of requiredScenarioFailures) {
|
||||
failures.push(
|
||||
`Required parity scenario ${scenario.name} failed: ${params.candidateLabel}=${scenario.candidateStatus}, ${params.baselineLabel}=${scenario.baselineStatus}.`,
|
||||
);
|
||||
}
|
||||
// Required parity scenarios are already reported via `requiredScenarioCoverage`
|
||||
// above; excluding them here keeps the operator-facing failure list from
|
||||
// double-counting the same missing scenario (one "Missing required parity scenario
|
||||
@@ -281,8 +478,13 @@ export function buildQaAgenticParityComparison(params: {
|
||||
}
|
||||
|
||||
export function renderQaAgenticParityMarkdownReport(comparison: QaAgenticParityComparison): string {
|
||||
// Title is parametrized from the candidate / baseline labels so reports
|
||||
// for any candidate/baseline pair (not only gpt-5.4 vs opus 4.6) render
|
||||
// with an accurate header. The default CLI labels are still
|
||||
// openai/gpt-5.4 vs anthropic/claude-opus-4-6, but the helper works for
|
||||
// any parity comparison a caller configures.
|
||||
const lines = [
|
||||
"# OpenClaw GPT-5.4 / Opus 4.6 Agentic Parity Report",
|
||||
`# OpenClaw Agentic Parity Report — ${comparison.candidateLabel} vs ${comparison.baselineLabel}`,
|
||||
"",
|
||||
`- Compared at: ${comparison.comparedAt}`,
|
||||
`- Candidate: ${comparison.candidateLabel}`,
|
||||
|
||||
@@ -4,22 +4,57 @@ export const QA_AGENTIC_PARITY_SCENARIOS = [
|
||||
{
|
||||
id: "approval-turn-tool-followthrough",
|
||||
title: "Approval turn tool followthrough",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "model-switch-tool-continuity",
|
||||
title: "Model switch with tool continuity",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "source-docs-discovery-report",
|
||||
title: "Source and docs discovery report",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "image-understanding-attachment",
|
||||
title: "Image understanding from attachment",
|
||||
countsTowardValidToolCallRate: false,
|
||||
},
|
||||
{
|
||||
id: "compaction-retry-mutating-tool",
|
||||
title: "Compaction retry after mutating tool",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "subagent-handoff",
|
||||
title: "Subagent handoff",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "subagent-fanout-synthesis",
|
||||
title: "Subagent fanout synthesis",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "memory-recall",
|
||||
title: "Memory recall after context switch",
|
||||
countsTowardValidToolCallRate: false,
|
||||
},
|
||||
{
|
||||
id: "thread-memory-isolation",
|
||||
title: "Thread memory isolation",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "config-restart-capability-flip",
|
||||
title: "Config restart capability flip",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
{
|
||||
id: "instruction-followthrough-repo-contract",
|
||||
title: "Instruction followthrough repo contract",
|
||||
countsTowardValidToolCallRate: true,
|
||||
},
|
||||
] as const;
|
||||
|
||||
@@ -27,6 +62,9 @@ export const QA_AGENTIC_PARITY_SCENARIO_IDS = QA_AGENTIC_PARITY_SCENARIOS.map(({
|
||||
export const QA_AGENTIC_PARITY_SCENARIO_TITLES = QA_AGENTIC_PARITY_SCENARIOS.map(
|
||||
({ title }) => title,
|
||||
);
|
||||
export const QA_AGENTIC_PARITY_TOOL_BACKED_SCENARIO_TITLES = QA_AGENTIC_PARITY_SCENARIOS.filter(
|
||||
({ countsTowardValidToolCallRate }) => countsTowardValidToolCallRate,
|
||||
).map(({ title }) => title);
|
||||
|
||||
export function resolveQaParityPackScenarioIds(params: {
|
||||
parityPack?: string;
|
||||
|
||||
@@ -338,6 +338,12 @@ describe("qa cli runtime", () => {
|
||||
"source-docs-discovery-report",
|
||||
"image-understanding-attachment",
|
||||
"compaction-retry-mutating-tool",
|
||||
"subagent-handoff",
|
||||
"subagent-fanout-synthesis",
|
||||
"memory-recall",
|
||||
"thread-memory-isolation",
|
||||
"config-restart-capability-flip",
|
||||
"instruction-followthrough-repo-contract",
|
||||
],
|
||||
}),
|
||||
);
|
||||
@@ -566,6 +572,39 @@ describe("qa cli runtime", () => {
|
||||
);
|
||||
});
|
||||
|
||||
it("passes provider-qualified mock parity suite selection through to the host runner", async () => {
|
||||
await runQaSuiteCommand({
|
||||
repoRoot: "/tmp/openclaw-repo",
|
||||
providerMode: "mock-openai",
|
||||
parityPack: "agentic",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
alternateModel: "anthropic/claude-opus-4-6",
|
||||
});
|
||||
|
||||
expect(runQaSuiteFromRuntime).toHaveBeenCalledWith({
|
||||
repoRoot: path.resolve("/tmp/openclaw-repo"),
|
||||
outputDir: undefined,
|
||||
transportId: "qa-channel",
|
||||
providerMode: "mock-openai",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
alternateModel: "anthropic/claude-opus-4-6",
|
||||
fastMode: undefined,
|
||||
scenarioIds: [
|
||||
"approval-turn-tool-followthrough",
|
||||
"model-switch-tool-continuity",
|
||||
"source-docs-discovery-report",
|
||||
"image-understanding-attachment",
|
||||
"compaction-retry-mutating-tool",
|
||||
"subagent-handoff",
|
||||
"subagent-fanout-synthesis",
|
||||
"memory-recall",
|
||||
"thread-memory-isolation",
|
||||
"config-restart-capability-flip",
|
||||
"instruction-followthrough-repo-contract",
|
||||
],
|
||||
});
|
||||
});
|
||||
|
||||
it("rejects multipass-only suite flags on the host runner", async () => {
|
||||
await expect(
|
||||
runQaSuiteCommand({
|
||||
|
||||
@@ -64,6 +64,11 @@ describe("buildQaRuntimeEnv", () => {
|
||||
expect(env.GEMINI_API_KEY).toBe("gemini-live");
|
||||
});
|
||||
|
||||
it("defaults gateway-child provider mode to mock-openai when omitted", () => {
|
||||
expect(__testing.resolveQaGatewayChildProviderMode(undefined)).toBe("mock-openai");
|
||||
expect(__testing.resolveQaGatewayChildProviderMode("live-frontier")).toBe("live-frontier");
|
||||
});
|
||||
|
||||
it("keeps explicit provider env vars over live aliases", () => {
|
||||
const env = buildQaRuntimeEnv({
|
||||
...createParams({
|
||||
@@ -299,6 +304,88 @@ describe("buildQaRuntimeEnv", () => {
|
||||
});
|
||||
});
|
||||
|
||||
it("stages placeholder mock auth profiles per agent dir so mock-openai runs can resolve credentials", async () => {
|
||||
const stateDir = await mkdtemp(path.join(os.tmpdir(), "qa-mock-auth-"));
|
||||
cleanups.push(async () => {
|
||||
await rm(stateDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
const cfg = await __testing.stageQaMockAuthProfiles({
|
||||
cfg: {},
|
||||
stateDir,
|
||||
});
|
||||
|
||||
// Config side: both providers should have a profile entry with mode
|
||||
// "api_key" so the runtime picks up the staging without any further
|
||||
// config mutation.
|
||||
expect(cfg.auth?.profiles?.["qa-mock-openai"]).toMatchObject({
|
||||
provider: "openai",
|
||||
mode: "api_key",
|
||||
displayName: "QA mock openai credential",
|
||||
});
|
||||
expect(cfg.auth?.profiles?.["qa-mock-anthropic"]).toMatchObject({
|
||||
provider: "anthropic",
|
||||
mode: "api_key",
|
||||
displayName: "QA mock anthropic credential",
|
||||
});
|
||||
|
||||
// Store side: each agent dir should have its own auth-profiles.json
|
||||
// containing the placeholder credential for each staged provider. This
|
||||
// is what the scenario runner actually reads when it resolves auth
|
||||
// before calling the mock.
|
||||
for (const agentId of ["main", "qa"]) {
|
||||
const storeRaw = await readFile(
|
||||
path.join(stateDir, "agents", agentId, "agent", "auth-profiles.json"),
|
||||
"utf8",
|
||||
);
|
||||
const parsed = JSON.parse(storeRaw) as {
|
||||
profiles: Record<string, { type: string; provider: string; key: string }>;
|
||||
};
|
||||
expect(parsed.profiles["qa-mock-openai"]).toMatchObject({
|
||||
type: "api_key",
|
||||
provider: "openai",
|
||||
key: "qa-mock-not-a-real-key",
|
||||
});
|
||||
expect(parsed.profiles["qa-mock-anthropic"]).toMatchObject({
|
||||
type: "api_key",
|
||||
provider: "anthropic",
|
||||
key: "qa-mock-not-a-real-key",
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
it("stages mock profiles only for the requested agents and providers when callers override the defaults", async () => {
|
||||
const stateDir = await mkdtemp(path.join(os.tmpdir(), "qa-mock-auth-override-"));
|
||||
cleanups.push(async () => {
|
||||
await rm(stateDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
const cfg = await __testing.stageQaMockAuthProfiles({
|
||||
cfg: {},
|
||||
stateDir,
|
||||
agentIds: ["qa"],
|
||||
providers: ["openai"],
|
||||
});
|
||||
|
||||
expect(cfg.auth?.profiles?.["qa-mock-openai"]).toMatchObject({
|
||||
provider: "openai",
|
||||
mode: "api_key",
|
||||
});
|
||||
// Anthropic should NOT be staged when the caller restricts providers.
|
||||
expect(cfg.auth?.profiles?.["qa-mock-anthropic"]).toBeUndefined();
|
||||
|
||||
const qaStore = JSON.parse(
|
||||
await readFile(path.join(stateDir, "agents", "qa", "agent", "auth-profiles.json"), "utf8"),
|
||||
) as { profiles: Record<string, unknown> };
|
||||
expect(qaStore.profiles["qa-mock-openai"]).toBeDefined();
|
||||
expect(qaStore.profiles["qa-mock-anthropic"]).toBeUndefined();
|
||||
|
||||
// main/agent should not exist because it wasn't in the agentIds list.
|
||||
await expect(
|
||||
readFile(path.join(stateDir, "agents", "main", "agent", "auth-profiles.json"), "utf8"),
|
||||
).rejects.toThrow(/ENOENT/);
|
||||
});
|
||||
|
||||
it("allows loopback gateway health probes through the SSRF guard", async () => {
|
||||
const release = vi.fn(async () => {});
|
||||
fetchWithSsrFGuardMock.mockResolvedValue({
|
||||
|
||||
@@ -222,6 +222,12 @@ export function normalizeQaProviderModeEnv(
|
||||
return env;
|
||||
}
|
||||
|
||||
export function resolveQaGatewayChildProviderMode(
|
||||
providerMode?: "mock-openai" | "live-frontier",
|
||||
): "mock-openai" | "live-frontier" {
|
||||
return providerMode ?? "mock-openai";
|
||||
}
|
||||
|
||||
function resolveQaLiveCliAuthEnv(
|
||||
baseEnv: NodeJS.ProcessEnv,
|
||||
opts?: {
|
||||
@@ -395,6 +401,72 @@ export async function stageQaLiveAnthropicSetupToken(params: {
|
||||
});
|
||||
}
|
||||
|
||||
/** Providers the mock-openai harness stages placeholder credentials for. */
|
||||
export const QA_MOCK_AUTH_PROVIDERS = Object.freeze(["openai", "anthropic"] as const);
|
||||
|
||||
/** Agent IDs the mock-openai harness stages credentials under. */
|
||||
export const QA_MOCK_AUTH_AGENT_IDS = Object.freeze(["main", "qa"] as const);
|
||||
|
||||
export function buildQaMockProfileId(provider: string): string {
|
||||
return `qa-mock-${provider}`;
|
||||
}
|
||||
|
||||
/**
|
||||
* In mock-openai mode the qa suite runs against the embedded mock server
|
||||
* instead of a real provider API. The mock does not validate credentials, but
|
||||
* the agent auth layer still needs a matching `api_key` auth profile in
|
||||
* `auth-profiles.json` before it will route the request through
|
||||
* `providerBaseUrl`. Without this staging step, every scenario fails with
|
||||
* `FailoverError: No API key found for provider "openai"` before the mock
|
||||
* server ever sees a request.
|
||||
*
|
||||
* Stages a placeholder `api_key` profile per provider in each of the agent
|
||||
* dirs the qa suite uses (`main` for the runtime config, `qa` for scenario
|
||||
* runs) and returns a config with matching `auth.profiles` entries so the
|
||||
* runtime accepts the profile on the first lookup.
|
||||
*
|
||||
* The placeholder value `qa-mock-not-a-real-key` is intentionally not
|
||||
* shaped like a real API key (no `sk-` prefix that would trip secret
|
||||
* scanners). It only needs to be non-empty to pass the credential
|
||||
* serializer; anything beyond that is ignored by the mock.
|
||||
*/
|
||||
export async function stageQaMockAuthProfiles(params: {
|
||||
cfg: OpenClawConfig;
|
||||
stateDir: string;
|
||||
agentIds?: readonly string[];
|
||||
providers?: readonly string[];
|
||||
}): Promise<OpenClawConfig> {
|
||||
const agentIds = [...new Set(params.agentIds ?? QA_MOCK_AUTH_AGENT_IDS)];
|
||||
const providers = [...new Set(params.providers ?? QA_MOCK_AUTH_PROVIDERS)];
|
||||
let next = params.cfg;
|
||||
for (const agentId of agentIds) {
|
||||
const agentDir = path.join(params.stateDir, "agents", agentId, "agent");
|
||||
await fs.mkdir(agentDir, { recursive: true });
|
||||
for (const provider of providers) {
|
||||
const profileId = buildQaMockProfileId(provider);
|
||||
upsertAuthProfile({
|
||||
profileId,
|
||||
credential: {
|
||||
type: "api_key",
|
||||
provider,
|
||||
key: "qa-mock-not-a-real-key",
|
||||
displayName: `QA mock ${provider} credential`,
|
||||
},
|
||||
agentDir,
|
||||
});
|
||||
}
|
||||
}
|
||||
for (const provider of providers) {
|
||||
next = applyAuthProfileConfig(next, {
|
||||
profileId: buildQaMockProfileId(provider),
|
||||
provider,
|
||||
mode: "api_key",
|
||||
displayName: `QA mock ${provider} credential`,
|
||||
});
|
||||
}
|
||||
return next;
|
||||
}
|
||||
|
||||
function isRetryableGatewayCallError(details: string): boolean {
|
||||
return (
|
||||
details.includes("handshake timeout") ||
|
||||
@@ -440,8 +512,10 @@ export const __testing = {
|
||||
preserveQaGatewayDebugArtifacts,
|
||||
redactQaGatewayDebugText,
|
||||
readQaLiveProviderConfigOverrides,
|
||||
resolveQaGatewayChildProviderMode,
|
||||
resolveQaLiveAnthropicSetupToken,
|
||||
stageQaLiveAnthropicSetupToken,
|
||||
stageQaMockAuthProfiles,
|
||||
resolveQaLiveCliAuthEnv,
|
||||
resolveQaOwnerPluginIdsForProviderIds,
|
||||
resolveQaBundledPluginsSourceRoot,
|
||||
@@ -868,8 +942,9 @@ export async function startQaGatewayChild(params: {
|
||||
fs.mkdir(xdgDataHome, { recursive: true }),
|
||||
fs.mkdir(xdgCacheHome, { recursive: true }),
|
||||
]);
|
||||
const providerMode = resolveQaGatewayChildProviderMode(params.providerMode);
|
||||
const liveProviderIds =
|
||||
params.providerMode === "live-frontier"
|
||||
providerMode === "live-frontier"
|
||||
? [params.primaryModel, params.alternateModel]
|
||||
.map((modelRef) =>
|
||||
typeof modelRef === "string" ? splitQaModelRef(modelRef)?.provider : undefined,
|
||||
@@ -902,7 +977,7 @@ export async function startQaGatewayChild(params: {
|
||||
controlUiEnabled: params.controlUiEnabled,
|
||||
}),
|
||||
controlUiAllowedOrigins: params.controlUiAllowedOrigins,
|
||||
providerMode: params.providerMode,
|
||||
providerMode,
|
||||
primaryModel: params.primaryModel,
|
||||
alternateModel: params.alternateModel,
|
||||
enabledPluginIds,
|
||||
@@ -921,6 +996,12 @@ export async function startQaGatewayChild(params: {
|
||||
cfg,
|
||||
stateDir,
|
||||
});
|
||||
if (providerMode === "mock-openai") {
|
||||
cfg = await stageQaMockAuthProfiles({
|
||||
cfg,
|
||||
stateDir,
|
||||
});
|
||||
}
|
||||
return params.mutateConfig ? params.mutateConfig(cfg) : cfg;
|
||||
};
|
||||
const stdout: Buffer[] = [];
|
||||
@@ -981,7 +1062,7 @@ export async function startQaGatewayChild(params: {
|
||||
xdgCacheHome,
|
||||
bundledPluginsDir,
|
||||
compatibilityHostVersion: runtimeHostVersion,
|
||||
providerMode: params.providerMode,
|
||||
providerMode,
|
||||
forwardHostHomeForClaudeCli: liveProviderIds.includes("claude-cli"),
|
||||
claudeCliAuthMode: params.claudeCliAuthMode,
|
||||
});
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -22,6 +22,58 @@ type StreamEvent =
|
||||
};
|
||||
};
|
||||
|
||||
/**
|
||||
* Provider variant tag for `body.model`. The mock previously ignored
|
||||
* `body.model` for dispatch and only echoed it in the prose output, which
|
||||
* made the parity gate tautological when run against the mock alone
|
||||
* (both providers produced identical scenario plans by construction).
|
||||
* Tagging requests with a normalized variant lets individual scenario
|
||||
* branches opt into provider-specific behavior while the rest of the
|
||||
* dispatcher stays shared, and lets `/debug/requests` consumers verify
|
||||
* which provider lane a given request came from without re-parsing the
|
||||
* raw model string.
|
||||
*
|
||||
* Policy:
|
||||
* - `openai/*`, `gpt-*`, `o1-*`, anything starting with `gpt-` → `"openai"`
|
||||
* - `anthropic/*`, `claude-*` → `"anthropic"`
|
||||
* - Everything else (including empty strings) → `"unknown"`
|
||||
*
|
||||
* The `/v1/messages` route always feeds `body.model` straight through,
|
||||
* so an Anthropic request with an `openai/gpt-5.4` model string is still
|
||||
* classified as `"openai"`. That matches the parity program's convention
|
||||
* where the provider label is the source of truth, not the HTTP route.
|
||||
*/
|
||||
export type MockOpenAiProviderVariant = "openai" | "anthropic" | "unknown";
|
||||
|
||||
export function resolveProviderVariant(model: string | undefined): MockOpenAiProviderVariant {
|
||||
if (typeof model !== "string") {
|
||||
return "unknown";
|
||||
}
|
||||
const trimmed = model.trim().toLowerCase();
|
||||
if (trimmed.length === 0) {
|
||||
return "unknown";
|
||||
}
|
||||
// Prefer the explicit `provider/model` or `provider:model` prefix when
|
||||
// the caller supplied one — that's the most reliable signal.
|
||||
const separatorMatch = /^([^/:]+)[/:]/.exec(trimmed);
|
||||
const provider = separatorMatch?.[1] ?? trimmed;
|
||||
if (provider === "openai" || provider === "openai-codex") {
|
||||
return "openai";
|
||||
}
|
||||
if (provider === "anthropic" || provider === "claude-cli") {
|
||||
return "anthropic";
|
||||
}
|
||||
// Fall back to model-name prefix matching for bare model strings like
|
||||
// `gpt-5.4` or `claude-opus-4-6`.
|
||||
if (/^(?:gpt-|o1-|openai-)/.test(trimmed)) {
|
||||
return "openai";
|
||||
}
|
||||
if (/^(?:claude-|anthropic-)/.test(trimmed)) {
|
||||
return "anthropic";
|
||||
}
|
||||
return "unknown";
|
||||
}
|
||||
|
||||
type MockOpenAiRequestSnapshot = {
|
||||
raw: string;
|
||||
body: Record<string, unknown>;
|
||||
@@ -30,13 +82,52 @@ type MockOpenAiRequestSnapshot = {
|
||||
instructions?: string;
|
||||
toolOutput: string;
|
||||
model: string;
|
||||
providerVariant: MockOpenAiProviderVariant;
|
||||
imageInputCount: number;
|
||||
plannedToolName?: string;
|
||||
};
|
||||
|
||||
// Anthropic /v1/messages request/response shapes the mock actually needs.
|
||||
// This is a subset of the real Anthropic Messages API — just enough so the
|
||||
// QA suite can run its parity pack against a "baseline" Anthropic provider
|
||||
// without needing real API keys. The scenarios drive their dispatch through
|
||||
// the shared mock scenario logic (buildResponsesPayload), so whatever
|
||||
// behavior the OpenAI mock exposes is automatically mirrored on this route.
|
||||
type AnthropicMessageContentBlock =
|
||||
| { type: "text"; text: string }
|
||||
| {
|
||||
type: "tool_use";
|
||||
id: string;
|
||||
name: string;
|
||||
input: Record<string, unknown>;
|
||||
}
|
||||
| {
|
||||
type: "tool_result";
|
||||
tool_use_id: string;
|
||||
content: string | Array<{ type: "text"; text: string }>;
|
||||
}
|
||||
| { type: "image"; source: Record<string, unknown> };
|
||||
|
||||
type AnthropicMessage = {
|
||||
role: "user" | "assistant";
|
||||
content: string | AnthropicMessageContentBlock[];
|
||||
};
|
||||
|
||||
type AnthropicMessagesRequest = {
|
||||
model?: string;
|
||||
max_tokens?: number;
|
||||
system?: string | Array<{ type: "text"; text: string }>;
|
||||
messages?: AnthropicMessage[];
|
||||
tools?: Array<Record<string, unknown>>;
|
||||
stream?: boolean;
|
||||
};
|
||||
|
||||
const TINY_PNG_BASE64 =
|
||||
"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/x8AAwMCAO7Z0nQAAAAASUVORK5CYII=";
|
||||
let subagentFanoutPhase = 0;
|
||||
|
||||
type MockScenarioState = {
|
||||
subagentFanoutPhase: number;
|
||||
};
|
||||
|
||||
function readBody(req: IncomingMessage): Promise<string> {
|
||||
return new Promise((resolve, reject) => {
|
||||
@@ -68,6 +159,23 @@ function writeSse(res: ServerResponse, events: StreamEvent[]) {
|
||||
res.end(body);
|
||||
}
|
||||
|
||||
type AnthropicStreamEvent = Record<string, unknown> & {
|
||||
type: string;
|
||||
};
|
||||
|
||||
function writeAnthropicSse(res: ServerResponse, events: AnthropicStreamEvent[]) {
|
||||
const body = events
|
||||
.map((event) => `event: ${event.type}\ndata: ${JSON.stringify(event)}\n\n`)
|
||||
.join("");
|
||||
res.writeHead(200, {
|
||||
"content-type": "text/event-stream",
|
||||
"cache-control": "no-store",
|
||||
connection: "keep-alive",
|
||||
"content-length": Buffer.byteLength(body),
|
||||
});
|
||||
res.end(body);
|
||||
}
|
||||
|
||||
function countApproxTokens(text: string) {
|
||||
const trimmed = text.trim();
|
||||
if (!trimmed) {
|
||||
@@ -376,11 +484,11 @@ function extractLastCapture(text: string, pattern: RegExp) {
|
||||
}
|
||||
|
||||
function extractExactReplyDirective(text: string) {
|
||||
const colonMatch = extractLastCapture(text, /reply(?: with)? exactly:\s*([^\n]+)/i);
|
||||
if (colonMatch) {
|
||||
return colonMatch;
|
||||
const backtickedMatch = extractLastCapture(text, /reply(?: with)? exactly\s+`([^`]+)`/i);
|
||||
if (backtickedMatch) {
|
||||
return backtickedMatch;
|
||||
}
|
||||
return extractLastCapture(text, /reply(?: with)? exactly\s+`([^`]+)`/i);
|
||||
return extractLastCapture(text, /reply(?: with)? exactly:\s*([^\n]+)/i);
|
||||
}
|
||||
|
||||
function extractExactMarkerDirective(text: string) {
|
||||
@@ -392,10 +500,18 @@ function extractExactMarkerDirective(text: string) {
|
||||
}
|
||||
|
||||
function isHeartbeatPrompt(text: string) {
|
||||
return /Read HEARTBEAT\.md if it exists/i.test(text);
|
||||
const trimmed = text.trim();
|
||||
if (!trimmed || /remember this fact/i.test(trimmed)) {
|
||||
return false;
|
||||
}
|
||||
return /(?:^|\n)Read HEARTBEAT\.md if it exists\b/i.test(trimmed);
|
||||
}
|
||||
|
||||
function buildAssistantText(input: ResponsesInputItem[], body: Record<string, unknown>) {
|
||||
function buildAssistantText(
|
||||
input: ResponsesInputItem[],
|
||||
body: Record<string, unknown>,
|
||||
scenarioState: MockScenarioState,
|
||||
) {
|
||||
const prompt = extractLastUserText(input);
|
||||
const toolOutput = extractToolOutput(input);
|
||||
const toolJson = parseToolOutputJson(toolOutput);
|
||||
@@ -411,8 +527,10 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
|
||||
: toolOutput;
|
||||
const orbitCode = extractOrbitCode(memorySnippet);
|
||||
const mediaPath = /MEDIA:([^\n]+)/.exec(toolOutput)?.[1]?.trim();
|
||||
const exactReplyDirective = extractExactReplyDirective(allInputText);
|
||||
const exactMarkerDirective = extractExactMarkerDirective(allInputText);
|
||||
const exactReplyDirective =
|
||||
extractExactReplyDirective(prompt) ?? extractExactReplyDirective(allInputText);
|
||||
const exactMarkerDirective =
|
||||
extractExactMarkerDirective(prompt) ?? extractExactMarkerDirective(allInputText);
|
||||
const imageInputCount = countImageInputs(input);
|
||||
const activeMemorySummary = extractActiveMemorySummary(allInputText);
|
||||
const snackPreference = extractSnackPreference(activeMemorySummary ?? memorySnippet);
|
||||
@@ -456,6 +574,23 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
|
||||
if (/tool continuity check/i.test(prompt) && toolOutput) {
|
||||
return `Protocol note: model switch handoff confirmed on ${model || "the requested model"}. QA mission from QA_KICKOFF_TASK.md still applies: understand this OpenClaw repo from source + docs before acting.`;
|
||||
}
|
||||
if (toolOutput && /repo contract followthrough check/i.test(prompt)) {
|
||||
if (
|
||||
/successfully (?:wrote|created|updated|replaced)/i.test(toolOutput) ||
|
||||
/status:\s*complete/i.test(toolOutput)
|
||||
) {
|
||||
return [
|
||||
"Read: AGENT.md, SOUL.md, FOLLOWTHROUGH_INPUT.md",
|
||||
"Wrote: repo-contract-summary.txt",
|
||||
"Status: complete",
|
||||
].join("\n");
|
||||
}
|
||||
return [
|
||||
"Read: AGENT.md, SOUL.md, FOLLOWTHROUGH_INPUT.md",
|
||||
"Wrote: repo-contract-summary.txt",
|
||||
"Status: blocked",
|
||||
].join("\n");
|
||||
}
|
||||
if (/session memory ranking check/i.test(prompt) && orbitCode) {
|
||||
return `Protocol note: I checked memory and the current Project Nebula codename is ${orbitCode}.`;
|
||||
}
|
||||
@@ -489,7 +624,11 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
|
||||
if (/fanout worker beta/i.test(prompt)) {
|
||||
return "BETA-OK";
|
||||
}
|
||||
if (/subagent fanout synthesis check/i.test(prompt) && toolOutput && subagentFanoutPhase >= 2) {
|
||||
if (
|
||||
/subagent fanout synthesis check/i.test(prompt) &&
|
||||
toolOutput &&
|
||||
scenarioState.subagentFanoutPhase >= 2
|
||||
) {
|
||||
return "Protocol note: delegated fanout complete. Alpha=ALPHA-OK. Beta=BETA-OK.";
|
||||
}
|
||||
if (toolOutput && (/\bdelegate\b/i.test(prompt) || /subagent handoff/i.test(prompt))) {
|
||||
@@ -579,7 +718,10 @@ function buildAssistantEvents(text: string): StreamEvent[] {
|
||||
];
|
||||
}
|
||||
|
||||
async function buildResponsesPayload(body: Record<string, unknown>) {
|
||||
async function buildResponsesPayload(
|
||||
body: Record<string, unknown>,
|
||||
scenarioState: MockScenarioState,
|
||||
) {
|
||||
const input = Array.isArray(body.input) ? (body.input as ResponsesInputItem[]) : [];
|
||||
const prompt = extractLastUserText(input);
|
||||
const toolOutput = extractToolOutput(input);
|
||||
@@ -587,6 +729,9 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
|
||||
const allInputText = extractAllRequestTexts(input, body);
|
||||
const isGroupChat = allInputText.includes('"is_group_chat": true');
|
||||
const isBaselineUnmentionedChannelChatter = /\bno bot ping here\b/i.test(prompt);
|
||||
if (/remember this fact/i.test(prompt)) {
|
||||
return buildAssistantEvents(buildAssistantText(input, body, scenarioState));
|
||||
}
|
||||
if (isHeartbeatPrompt(prompt)) {
|
||||
return buildAssistantEvents("HEARTBEAT_OK");
|
||||
}
|
||||
@@ -756,16 +901,16 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
|
||||
});
|
||||
}
|
||||
if (/subagent fanout synthesis check/i.test(prompt)) {
|
||||
if (!toolOutput && subagentFanoutPhase === 0) {
|
||||
subagentFanoutPhase = 1;
|
||||
if (!toolOutput && scenarioState.subagentFanoutPhase === 0) {
|
||||
scenarioState.subagentFanoutPhase = 1;
|
||||
return buildToolCallEventsWithArgs("sessions_spawn", {
|
||||
task: "Fanout worker alpha: inspect the QA workspace and finish with exactly ALPHA-OK.",
|
||||
label: "qa-fanout-alpha",
|
||||
thread: false,
|
||||
});
|
||||
}
|
||||
if (toolOutput && subagentFanoutPhase === 1) {
|
||||
subagentFanoutPhase = 2;
|
||||
if (toolOutput && scenarioState.subagentFanoutPhase === 1) {
|
||||
scenarioState.subagentFanoutPhase = 2;
|
||||
return buildToolCallEventsWithArgs("sessions_spawn", {
|
||||
task: "Fanout worker beta: inspect the QA workspace and finish with exactly BETA-OK.",
|
||||
label: "qa-fanout-beta",
|
||||
@@ -776,6 +921,30 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
|
||||
if (/tool continuity check/i.test(prompt) && !toolOutput) {
|
||||
return buildToolCallEventsWithArgs("read", { path: "QA_KICKOFF_TASK.md" });
|
||||
}
|
||||
if (/repo contract followthrough check/i.test(prompt)) {
|
||||
if (!toolOutput) {
|
||||
return buildToolCallEventsWithArgs("read", { path: "AGENT.md" });
|
||||
}
|
||||
if (toolOutput.includes("# Repo contract")) {
|
||||
return buildToolCallEventsWithArgs("read", { path: "SOUL.md" });
|
||||
}
|
||||
if (toolOutput.includes("# Execution style")) {
|
||||
return buildToolCallEventsWithArgs("read", { path: "FOLLOWTHROUGH_INPUT.md" });
|
||||
}
|
||||
if (
|
||||
toolOutput.includes("Mission: prove you followed the repo contract.") &&
|
||||
toolOutput.includes("Evidence path: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md")
|
||||
) {
|
||||
return buildToolCallEventsWithArgs("write", {
|
||||
path: "repo-contract-summary.txt",
|
||||
content: [
|
||||
"Mission: prove you followed the repo contract.",
|
||||
"Evidence: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md",
|
||||
"Status: complete",
|
||||
].join("\n"),
|
||||
});
|
||||
}
|
||||
}
|
||||
if ((/\bdelegate\b/i.test(prompt) || /subagent handoff/i.test(prompt)) && !toolOutput) {
|
||||
return buildToolCallEventsWithArgs("sessions_spawn", {
|
||||
task: "Inspect the QA workspace and return one concise protocol note.",
|
||||
@@ -807,12 +976,390 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
|
||||
) {
|
||||
await sleep(60_000);
|
||||
}
|
||||
return buildAssistantEvents(buildAssistantText(input, body));
|
||||
return buildAssistantEvents(buildAssistantText(input, body, scenarioState));
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Anthropic /v1/messages adapter
|
||||
// ---------------------------------------------------------------------------
|
||||
//
|
||||
// The QA parity gate needs two comparable scenario runs: one against the
|
||||
// "candidate" (openai/gpt-5.4) and one against the "baseline"
|
||||
// (anthropic/claude-opus-4-6). The OpenAI mock above already dispatches all
|
||||
// the scenario prompt branches we care about. Rather than duplicating that
|
||||
// machinery, the /v1/messages route below translates Anthropic request
|
||||
// shapes into the shared ResponsesInputItem[] format, calls the same
|
||||
// buildResponsesPayload() dispatcher, and then re-serializes the resulting
|
||||
// events into an Anthropic response. This gives the parity harness a
|
||||
// baseline lane that exercises the same scenario logic without requiring
|
||||
// real Anthropic API keys.
|
||||
//
|
||||
// Scope: handles Anthropic Messages requests with text and tool_result
|
||||
// content blocks, supporting both non-streaming JSON responses and the
|
||||
// streaming SSE path used by the parity harness.
|
||||
|
||||
function normalizeAnthropicSystemToString(
|
||||
system: AnthropicMessagesRequest["system"],
|
||||
): string | undefined {
|
||||
if (typeof system === "string") {
|
||||
return system.trim() || undefined;
|
||||
}
|
||||
if (Array.isArray(system)) {
|
||||
const joined = system
|
||||
.map((block) => (block?.type === "text" ? block.text : ""))
|
||||
.filter(Boolean)
|
||||
.join("\n")
|
||||
.trim();
|
||||
return joined || undefined;
|
||||
}
|
||||
return undefined;
|
||||
}
|
||||
|
||||
function stringifyToolResultContent(
|
||||
content: Extract<AnthropicMessageContentBlock, { type: "tool_result" }>["content"],
|
||||
): string {
|
||||
if (typeof content === "string") {
|
||||
return content;
|
||||
}
|
||||
if (Array.isArray(content)) {
|
||||
return content
|
||||
.map((block) => (block?.type === "text" ? block.text : ""))
|
||||
.filter(Boolean)
|
||||
.join("\n");
|
||||
}
|
||||
return "";
|
||||
}
|
||||
|
||||
function convertAnthropicMessagesToResponsesInput(params: {
|
||||
system?: AnthropicMessagesRequest["system"];
|
||||
messages: AnthropicMessage[];
|
||||
}): ResponsesInputItem[] {
|
||||
const items: ResponsesInputItem[] = [];
|
||||
const systemText = normalizeAnthropicSystemToString(params.system);
|
||||
if (systemText) {
|
||||
items.push({
|
||||
role: "system",
|
||||
content: [{ type: "input_text", text: systemText }],
|
||||
});
|
||||
}
|
||||
for (const message of params.messages) {
|
||||
const content = message.content;
|
||||
if (typeof content === "string") {
|
||||
items.push({
|
||||
role: message.role,
|
||||
content: [
|
||||
message.role === "assistant"
|
||||
? { type: "output_text", text: content }
|
||||
: { type: "input_text", text: content },
|
||||
],
|
||||
});
|
||||
continue;
|
||||
}
|
||||
if (!Array.isArray(content)) {
|
||||
continue;
|
||||
}
|
||||
// Buffer each block type so we can push in OpenAI-Responses order instead
|
||||
// of the order they appear in the Anthropic content array. The parent
|
||||
// role message must precede any function_call_output items from the same
|
||||
// turn, otherwise extractToolOutput() (which scans for
|
||||
// function_call_output AFTER the last user-role index) will not see the
|
||||
// output and the downstream scenario dispatcher will behave as if no
|
||||
// tool output was returned. Similarly, assistant tool_use blocks become
|
||||
// function_call items that must follow the assistant text message they
|
||||
// narrate.
|
||||
const textPieces: Array<{ type: "input_text" | "output_text"; text: string }> = [];
|
||||
const imagePieces: Array<{ type: "input_image"; image_url: string }> = [];
|
||||
const toolResultItems: ResponsesInputItem[] = [];
|
||||
const toolUseItems: ResponsesInputItem[] = [];
|
||||
for (const block of content) {
|
||||
if (!block || typeof block !== "object") {
|
||||
continue;
|
||||
}
|
||||
if (block.type === "text") {
|
||||
textPieces.push({
|
||||
type: message.role === "assistant" ? "output_text" : "input_text",
|
||||
text: block.text ?? "",
|
||||
});
|
||||
continue;
|
||||
}
|
||||
if (block.type === "image") {
|
||||
// Mock only needs to count image inputs; a placeholder URL is fine.
|
||||
imagePieces.push({ type: "input_image", image_url: "anthropic-mock:image" });
|
||||
continue;
|
||||
}
|
||||
if (block.type === "tool_result") {
|
||||
const output = stringifyToolResultContent(block.content);
|
||||
if (output.trim()) {
|
||||
toolResultItems.push({ type: "function_call_output", output });
|
||||
}
|
||||
continue;
|
||||
}
|
||||
if (block.type === "tool_use") {
|
||||
// Mirror OpenAI's function_call output_item shape so downstream
|
||||
// prompt extraction still sees "the assistant just emitted a tool
|
||||
// call". The scenario dispatcher looks for tool_output on the next
|
||||
// user turn, not the assistant's prior tool_use, so a minimal
|
||||
// placeholder is enough.
|
||||
toolUseItems.push({
|
||||
type: "function_call",
|
||||
name: block.name,
|
||||
arguments: JSON.stringify(block.input ?? {}),
|
||||
call_id: block.id,
|
||||
});
|
||||
continue;
|
||||
}
|
||||
}
|
||||
if (textPieces.length > 0 || imagePieces.length > 0) {
|
||||
const combinedContent: Array<Record<string, unknown>> = [...textPieces, ...imagePieces];
|
||||
items.push({ role: message.role, content: combinedContent });
|
||||
}
|
||||
// Emit tool_use (assistant prior calls) and tool_result (user-side
|
||||
// returns) AFTER the parent role message so extractLastUserText and
|
||||
// extractToolOutput walk the array in the order they expect. For a
|
||||
// tool_result-only user turn with no text/image blocks, the parent
|
||||
// message is intentionally omitted — the function_call_output itself
|
||||
// represents the user's "return the tool output" turn.
|
||||
for (const toolUse of toolUseItems) {
|
||||
items.push(toolUse);
|
||||
}
|
||||
for (const toolResult of toolResultItems) {
|
||||
items.push(toolResult);
|
||||
}
|
||||
}
|
||||
return items;
|
||||
}
|
||||
|
||||
type ExtractedAssistantOutput = {
|
||||
text: string;
|
||||
toolCalls: Array<{ id: string; name: string; input: Record<string, unknown> }>;
|
||||
};
|
||||
|
||||
function extractFinalAssistantOutputFromEvents(events: StreamEvent[]): ExtractedAssistantOutput {
|
||||
const toolCalls: ExtractedAssistantOutput["toolCalls"] = [];
|
||||
let text = "";
|
||||
for (const event of events) {
|
||||
if (event.type !== "response.output_item.done") {
|
||||
continue;
|
||||
}
|
||||
const item = event.item as {
|
||||
type?: unknown;
|
||||
name?: unknown;
|
||||
call_id?: unknown;
|
||||
id?: unknown;
|
||||
arguments?: unknown;
|
||||
content?: unknown;
|
||||
};
|
||||
if (item.type === "function_call" && typeof item.name === "string") {
|
||||
let input: Record<string, unknown> = {};
|
||||
if (typeof item.arguments === "string" && item.arguments.trim()) {
|
||||
try {
|
||||
const parsed = JSON.parse(item.arguments) as unknown;
|
||||
if (parsed && typeof parsed === "object" && !Array.isArray(parsed)) {
|
||||
input = parsed as Record<string, unknown>;
|
||||
}
|
||||
} catch {
|
||||
// keep empty input on malformed args — mock dispatcher owns arg shape
|
||||
}
|
||||
}
|
||||
toolCalls.push({
|
||||
id: typeof item.call_id === "string" ? item.call_id : `toolu_mock_${toolCalls.length + 1}`,
|
||||
name: item.name,
|
||||
input,
|
||||
});
|
||||
continue;
|
||||
}
|
||||
if (item.type === "message" && Array.isArray(item.content)) {
|
||||
for (const piece of item.content as Array<{ type?: unknown; text?: unknown }>) {
|
||||
if (piece?.type === "output_text" && typeof piece.text === "string") {
|
||||
text = piece.text;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return { text, toolCalls };
|
||||
}
|
||||
|
||||
function buildAnthropicMessageResponse(params: {
|
||||
model: string;
|
||||
extracted: ExtractedAssistantOutput;
|
||||
}): Record<string, unknown> {
|
||||
const content: Array<Record<string, unknown>> = [];
|
||||
if (params.extracted.text) {
|
||||
content.push({ type: "text", text: params.extracted.text });
|
||||
}
|
||||
for (const call of params.extracted.toolCalls) {
|
||||
content.push({
|
||||
type: "tool_use",
|
||||
id: call.id,
|
||||
name: call.name,
|
||||
input: call.input,
|
||||
});
|
||||
}
|
||||
if (content.length === 0) {
|
||||
content.push({ type: "text", text: "" });
|
||||
}
|
||||
const stopReason = params.extracted.toolCalls.length > 0 ? "tool_use" : "end_turn";
|
||||
const approxInputTokens = 64;
|
||||
const approxOutputTokens = Math.max(
|
||||
16,
|
||||
countApproxTokens(params.extracted.text) + params.extracted.toolCalls.length * 16,
|
||||
);
|
||||
return {
|
||||
id: `msg_mock_${Math.floor(Math.random() * 1_000_000).toString(16)}`,
|
||||
type: "message",
|
||||
role: "assistant",
|
||||
model: params.model || "claude-opus-4-6",
|
||||
content,
|
||||
stop_reason: stopReason,
|
||||
stop_sequence: null,
|
||||
usage: {
|
||||
input_tokens: approxInputTokens,
|
||||
output_tokens: approxOutputTokens,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
function buildAnthropicMessageStreamEvents(params: {
|
||||
model: string;
|
||||
extracted: ExtractedAssistantOutput;
|
||||
}): AnthropicStreamEvent[] {
|
||||
const approxInputTokens = 64;
|
||||
const approxOutputTokens = Math.max(
|
||||
16,
|
||||
countApproxTokens(params.extracted.text) + params.extracted.toolCalls.length * 16,
|
||||
);
|
||||
const messageId = `msg_mock_${Math.floor(Math.random() * 1_000_000).toString(16)}`;
|
||||
const events: AnthropicStreamEvent[] = [
|
||||
{
|
||||
type: "message_start",
|
||||
message: {
|
||||
id: messageId,
|
||||
type: "message",
|
||||
role: "assistant",
|
||||
model: params.model || "claude-opus-4-6",
|
||||
content: [],
|
||||
stop_reason: null,
|
||||
stop_sequence: null,
|
||||
usage: {
|
||||
input_tokens: approxInputTokens,
|
||||
output_tokens: 0,
|
||||
},
|
||||
},
|
||||
},
|
||||
];
|
||||
let index = 0;
|
||||
if (params.extracted.text || params.extracted.toolCalls.length === 0) {
|
||||
events.push({
|
||||
type: "content_block_start",
|
||||
index,
|
||||
content_block: {
|
||||
type: "text",
|
||||
text: "",
|
||||
},
|
||||
});
|
||||
if (params.extracted.text) {
|
||||
events.push({
|
||||
type: "content_block_delta",
|
||||
index,
|
||||
delta: {
|
||||
type: "text_delta",
|
||||
text: params.extracted.text,
|
||||
},
|
||||
});
|
||||
}
|
||||
events.push({
|
||||
type: "content_block_stop",
|
||||
index,
|
||||
});
|
||||
index += 1;
|
||||
}
|
||||
for (const call of params.extracted.toolCalls) {
|
||||
events.push({
|
||||
type: "content_block_start",
|
||||
index,
|
||||
content_block: {
|
||||
type: "tool_use",
|
||||
id: call.id,
|
||||
name: call.name,
|
||||
input: {},
|
||||
},
|
||||
});
|
||||
events.push({
|
||||
type: "content_block_delta",
|
||||
index,
|
||||
delta: {
|
||||
type: "input_json_delta",
|
||||
partial_json: JSON.stringify(call.input ?? {}),
|
||||
},
|
||||
});
|
||||
events.push({
|
||||
type: "content_block_stop",
|
||||
index,
|
||||
});
|
||||
index += 1;
|
||||
}
|
||||
events.push({
|
||||
type: "message_delta",
|
||||
delta: {
|
||||
stop_reason: params.extracted.toolCalls.length > 0 ? "tool_use" : "end_turn",
|
||||
},
|
||||
usage: {
|
||||
input_tokens: approxInputTokens,
|
||||
output_tokens: approxOutputTokens,
|
||||
},
|
||||
});
|
||||
events.push({
|
||||
type: "message_stop",
|
||||
});
|
||||
return events;
|
||||
}
|
||||
|
||||
async function buildMessagesPayload(
|
||||
body: AnthropicMessagesRequest,
|
||||
scenarioState: MockScenarioState,
|
||||
): Promise<{
|
||||
events: StreamEvent[];
|
||||
input: ResponsesInputItem[];
|
||||
extracted: ExtractedAssistantOutput;
|
||||
responseBody: Record<string, unknown>;
|
||||
streamEvents: AnthropicStreamEvent[];
|
||||
model: string;
|
||||
}> {
|
||||
const messages = Array.isArray(body.messages) ? body.messages : [];
|
||||
const input = convertAnthropicMessagesToResponsesInput({
|
||||
system: body.system,
|
||||
messages,
|
||||
});
|
||||
// Treat empty-string model the same as absent. A bare typeof check lets
|
||||
// `""` leak through to `responseBody.model` and `lastRequest.model`,
|
||||
// which then confuses parity consumers that assume the mock always
|
||||
// echoes the real provider label. Normalize once and reuse everywhere.
|
||||
const normalizedModel =
|
||||
typeof body.model === "string" && body.model.trim() !== "" ? body.model : "claude-opus-4-6";
|
||||
// Dispatch through the same scenario logic the /v1/responses route uses.
|
||||
// The mock dispatcher only reads `body.input`, `body.model`, and
|
||||
// `body.stream`, so a synthetic shim body is sufficient.
|
||||
const dispatchBody: Record<string, unknown> = {
|
||||
input,
|
||||
model: normalizedModel,
|
||||
stream: false,
|
||||
};
|
||||
const events = await buildResponsesPayload(dispatchBody, scenarioState);
|
||||
const extracted = extractFinalAssistantOutputFromEvents(events);
|
||||
const responseBody = buildAnthropicMessageResponse({
|
||||
model: normalizedModel,
|
||||
extracted,
|
||||
});
|
||||
const streamEvents = buildAnthropicMessageStreamEvents({
|
||||
model: normalizedModel,
|
||||
extracted,
|
||||
});
|
||||
return { events, input, extracted, responseBody, streamEvents, model: normalizedModel };
|
||||
}
|
||||
|
||||
export async function startQaMockOpenAiServer(params?: { host?: string; port?: number }) {
|
||||
const host = params?.host ?? "127.0.0.1";
|
||||
subagentFanoutPhase = 0;
|
||||
const scenarioState: MockScenarioState = { subagentFanoutPhase: 0 };
|
||||
let lastRequest: MockOpenAiRequestSnapshot | null = null;
|
||||
const requests: MockOpenAiRequestSnapshot[] = [];
|
||||
const imageGenerationRequests: Array<Record<string, unknown>> = [];
|
||||
@@ -829,6 +1376,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
|
||||
{ id: "gpt-5.4-alt", object: "model" },
|
||||
{ id: "gpt-image-1", object: "model" },
|
||||
{ id: "text-embedding-3-small", object: "model" },
|
||||
{ id: "claude-opus-4-6", object: "model" },
|
||||
{ id: "claude-sonnet-4-6", object: "model" },
|
||||
],
|
||||
});
|
||||
return;
|
||||
@@ -888,7 +1437,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
|
||||
const raw = await readBody(req);
|
||||
const body = raw ? (JSON.parse(raw) as Record<string, unknown>) : {};
|
||||
const input = Array.isArray(body.input) ? (body.input as ResponsesInputItem[]) : [];
|
||||
const events = await buildResponsesPayload(body);
|
||||
const events = await buildResponsesPayload(body, scenarioState);
|
||||
const resolvedModel = typeof body.model === "string" ? body.model : "";
|
||||
lastRequest = {
|
||||
raw,
|
||||
body,
|
||||
@@ -896,7 +1446,8 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
|
||||
allInputText: extractAllRequestTexts(input, body),
|
||||
instructions: extractInstructionsText(body) || undefined,
|
||||
toolOutput: extractToolOutput(input),
|
||||
model: typeof body.model === "string" ? body.model : "",
|
||||
model: resolvedModel,
|
||||
providerVariant: resolveProviderVariant(resolvedModel),
|
||||
imageInputCount: countImageInputs(input),
|
||||
plannedToolName: extractPlannedToolName(events),
|
||||
};
|
||||
@@ -916,6 +1467,56 @@ export async function startQaMockOpenAiServer(params?: { host?: string; port?: n
|
||||
writeSse(res, events);
|
||||
return;
|
||||
}
|
||||
if (req.method === "POST" && url.pathname === "/v1/messages") {
|
||||
const raw = await readBody(req);
|
||||
let body: AnthropicMessagesRequest = {};
|
||||
try {
|
||||
body = raw ? (JSON.parse(raw) as AnthropicMessagesRequest) : {};
|
||||
} catch {
|
||||
writeJson(res, 400, {
|
||||
type: "error",
|
||||
error: {
|
||||
type: "invalid_request_error",
|
||||
message: "Malformed JSON body for Anthropic Messages request.",
|
||||
},
|
||||
});
|
||||
return;
|
||||
}
|
||||
const {
|
||||
events,
|
||||
input,
|
||||
responseBody,
|
||||
streamEvents,
|
||||
model: normalizedModel,
|
||||
} = await buildMessagesPayload(body, scenarioState);
|
||||
// Record the adapted request snapshot so /debug/requests gives the QA
|
||||
// suite the same plannedToolName / allInputText / toolOutput signals
|
||||
// on the Anthropic route that the OpenAI route already exposes. This
|
||||
// is what lets a single parity run diff assertions across both lanes.
|
||||
// Reuse the normalized model so an empty-string body.model no longer
|
||||
// leaks through to `lastRequest.model`.
|
||||
lastRequest = {
|
||||
raw,
|
||||
body: body as Record<string, unknown>,
|
||||
prompt: extractLastUserText(input),
|
||||
allInputText: extractAllInputTexts(input),
|
||||
toolOutput: extractToolOutput(input),
|
||||
model: normalizedModel,
|
||||
providerVariant: resolveProviderVariant(normalizedModel),
|
||||
imageInputCount: countImageInputs(input),
|
||||
plannedToolName: extractPlannedToolName(events),
|
||||
};
|
||||
requests.push(lastRequest);
|
||||
if (requests.length > 50) {
|
||||
requests.splice(0, requests.length - 50);
|
||||
}
|
||||
if (body.stream === true) {
|
||||
writeAnthropicSse(res, streamEvents);
|
||||
return;
|
||||
}
|
||||
writeJson(res, 200, responseBody);
|
||||
return;
|
||||
}
|
||||
writeJson(res, 404, { error: "not found" });
|
||||
});
|
||||
|
||||
|
||||
@@ -53,6 +53,11 @@ describe("buildQaGatewayConfig", () => {
|
||||
|
||||
expect(getPrimaryModel(cfg.agents?.defaults?.model)).toBe("mock-openai/gpt-5.4");
|
||||
expect(cfg.models?.providers?.["mock-openai"]?.baseUrl).toBe("http://127.0.0.1:44080/v1");
|
||||
expect(cfg.models?.providers?.["mock-openai"]?.request).toEqual({ allowPrivateNetwork: true });
|
||||
expect(cfg.models?.providers?.openai?.baseUrl).toBe("http://127.0.0.1:44080/v1");
|
||||
expect(cfg.models?.providers?.openai?.request).toEqual({ allowPrivateNetwork: true });
|
||||
expect(cfg.models?.providers?.anthropic?.baseUrl).toBe("http://127.0.0.1:44080");
|
||||
expect(cfg.models?.providers?.anthropic?.request).toEqual({ allowPrivateNetwork: true });
|
||||
expect(cfg.plugins?.allow).toEqual(["memory-core", "qa-channel"]);
|
||||
expect(cfg.plugins?.entries?.["memory-core"]).toEqual({ enabled: true });
|
||||
expect(cfg.plugins?.entries?.["qa-channel"]).toEqual({ enabled: true });
|
||||
@@ -66,6 +71,31 @@ describe("buildQaGatewayConfig", () => {
|
||||
expect(cfg.messages?.groupChat?.mentionPatterns).toEqual(["\\b@?openclaw\\b"]);
|
||||
});
|
||||
|
||||
it("maps provider-qualified openai and anthropic refs through the mock provider lane", () => {
|
||||
const cfg = buildQaGatewayConfig({
|
||||
bind: "loopback",
|
||||
gatewayPort: 18789,
|
||||
gatewayToken: "token",
|
||||
providerBaseUrl: "http://127.0.0.1:44080/v1",
|
||||
workspaceDir: "/tmp/qa-workspace",
|
||||
providerMode: "mock-openai",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
alternateModel: "anthropic/claude-opus-4-6",
|
||||
});
|
||||
|
||||
expect(getPrimaryModel(cfg.agents?.defaults?.model)).toBe("openai/gpt-5.4");
|
||||
expect(cfg.models?.providers?.openai?.api).toBe("openai-responses");
|
||||
expect(cfg.models?.providers?.openai?.request).toEqual({ allowPrivateNetwork: true });
|
||||
expect(cfg.models?.providers?.openai?.models.map((model) => model.id)).toContain("gpt-5.4");
|
||||
expect(cfg.models?.providers?.anthropic?.api).toBe("anthropic-messages");
|
||||
expect(cfg.models?.providers?.anthropic?.baseUrl).toBe("http://127.0.0.1:44080");
|
||||
expect(cfg.models?.providers?.anthropic?.request).toEqual({ allowPrivateNetwork: true });
|
||||
expect(cfg.models?.providers?.anthropic?.models.map((model) => model.id)).toContain(
|
||||
"claude-opus-4-6",
|
||||
);
|
||||
expect(cfg.plugins?.allow).toEqual(["memory-core"]);
|
||||
});
|
||||
|
||||
it("can omit qa-channel for live transport gateway children", () => {
|
||||
const cfg = buildQaGatewayConfig({
|
||||
bind: "loopback",
|
||||
|
||||
@@ -45,6 +45,10 @@ export function normalizeQaThinkingLevel(input: unknown): QaThinkingLevel | unde
|
||||
return undefined;
|
||||
}
|
||||
|
||||
function trimTrailingApiV1(baseUrl: string) {
|
||||
return baseUrl.replace(/\/v1\/?$/i, "");
|
||||
}
|
||||
|
||||
export function mergeQaControlUiAllowedOrigins(extraOrigins?: string[]) {
|
||||
const normalizedExtra = (extraOrigins ?? [])
|
||||
.map((origin) => origin.trim())
|
||||
@@ -74,10 +78,14 @@ export function buildQaGatewayConfig(params: {
|
||||
thinkingDefault?: QaThinkingLevel;
|
||||
}): OpenClawConfig {
|
||||
const mockProviderBaseUrl = params.providerBaseUrl ?? "http://127.0.0.1:44080/v1";
|
||||
const mockAnthropicBaseUrl = trimTrailingApiV1(mockProviderBaseUrl);
|
||||
const mockOpenAiProvider: ModelProviderConfig = {
|
||||
baseUrl: mockProviderBaseUrl,
|
||||
apiKey: "test",
|
||||
api: "openai-responses",
|
||||
request: {
|
||||
allowPrivateNetwork: true,
|
||||
},
|
||||
models: [
|
||||
{
|
||||
id: "gpt-5.4",
|
||||
@@ -126,6 +134,50 @@ export function buildQaGatewayConfig(params: {
|
||||
},
|
||||
],
|
||||
};
|
||||
const mockNamedOpenAiProvider: ModelProviderConfig = {
|
||||
...mockOpenAiProvider,
|
||||
models: mockOpenAiProvider.models.map((model) => ({ ...model })),
|
||||
};
|
||||
const mockAnthropicProvider: ModelProviderConfig = {
|
||||
baseUrl: mockAnthropicBaseUrl,
|
||||
apiKey: "test",
|
||||
api: "anthropic-messages",
|
||||
request: {
|
||||
allowPrivateNetwork: true,
|
||||
},
|
||||
models: [
|
||||
{
|
||||
id: "claude-opus-4-6",
|
||||
name: "claude-opus-4-6",
|
||||
api: "anthropic-messages",
|
||||
reasoning: false,
|
||||
input: ["text", "image"],
|
||||
cost: {
|
||||
input: 0,
|
||||
output: 0,
|
||||
cacheRead: 0,
|
||||
cacheWrite: 0,
|
||||
},
|
||||
contextWindow: 200_000,
|
||||
maxTokens: 4096,
|
||||
},
|
||||
{
|
||||
id: "claude-sonnet-4-6",
|
||||
name: "claude-sonnet-4-6",
|
||||
api: "anthropic-messages",
|
||||
reasoning: false,
|
||||
input: ["text", "image"],
|
||||
cost: {
|
||||
input: 0,
|
||||
output: 0,
|
||||
cacheRead: 0,
|
||||
cacheWrite: 0,
|
||||
},
|
||||
contextWindow: 200_000,
|
||||
maxTokens: 4096,
|
||||
},
|
||||
],
|
||||
};
|
||||
const providerMode = normalizeQaProviderMode(params.providerMode ?? "mock-openai");
|
||||
const primaryModel = params.primaryModel ?? defaultQaModelForMode(providerMode);
|
||||
const alternateModel =
|
||||
@@ -273,6 +325,8 @@ export function buildQaGatewayConfig(params: {
|
||||
mode: "replace",
|
||||
providers: {
|
||||
"mock-openai": mockOpenAiProvider,
|
||||
openai: mockNamedOpenAiProvider,
|
||||
anthropic: mockAnthropicProvider,
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
@@ -118,6 +118,50 @@ describe("qa scenario catalog", () => {
|
||||
);
|
||||
});
|
||||
|
||||
it("keeps mock-only image debug assertions guarded in live-frontier runs", () => {
|
||||
const scenario = readQaScenarioPack().scenarios.find(
|
||||
(candidate) => candidate.id === "image-understanding-attachment",
|
||||
);
|
||||
const imageRequestAction = scenario?.execution.flow?.steps
|
||||
.flatMap((step) => step.actions ?? [])
|
||||
.find(
|
||||
(
|
||||
action,
|
||||
): action is {
|
||||
set: string;
|
||||
value?: { expr?: string };
|
||||
} =>
|
||||
typeof action === "object" &&
|
||||
action !== null &&
|
||||
"set" in action &&
|
||||
action.set === "imageRequest",
|
||||
);
|
||||
const imageRequestExpr = imageRequestAction?.value?.expr;
|
||||
|
||||
expect(imageRequestExpr).toContain("env.mock ?");
|
||||
expect(imageRequestExpr).toContain("/debug/requests");
|
||||
});
|
||||
|
||||
it("adds a repo-instruction followthrough scenario to the parity pack", () => {
|
||||
const scenario = readQaScenarioById("instruction-followthrough-repo-contract");
|
||||
const config = readQaScenarioExecutionConfig("instruction-followthrough-repo-contract") as
|
||||
| {
|
||||
workspaceFiles?: Record<string, string>;
|
||||
prompt?: string;
|
||||
expectedReplyAll?: string[];
|
||||
}
|
||||
| undefined;
|
||||
|
||||
expect(config?.workspaceFiles?.["AGENT.md"]).toContain("Step order:");
|
||||
expect(config?.workspaceFiles?.["SOUL.md"]).toContain("action-first");
|
||||
expect(config?.workspaceFiles?.["FOLLOWTHROUGH_INPUT.md"]).toContain(
|
||||
"Mission: prove you followed the repo contract.",
|
||||
);
|
||||
expect(config?.prompt).toContain("Repo contract followthrough check.");
|
||||
expect(config?.expectedReplyAll).toEqual(["read:", "wrote:", "status:"]);
|
||||
expect(scenario.title).toBe("Instruction followthrough repo contract");
|
||||
});
|
||||
|
||||
it("rejects malformed string matcher lists before running a flow", () => {
|
||||
expect(() =>
|
||||
validateQaScenarioExecutionConfig({
|
||||
|
||||
101
extensions/qa-lab/src/suite.summary-json.test.ts
Normal file
101
extensions/qa-lab/src/suite.summary-json.test.ts
Normal file
@@ -0,0 +1,101 @@
|
||||
import { describe, expect, it } from "vitest";
|
||||
import { buildQaSuiteSummaryJson } from "./suite.js";
|
||||
|
||||
describe("buildQaSuiteSummaryJson", () => {
|
||||
const baseParams = {
|
||||
// Test scenarios include a `steps: []` field to match the real suite
|
||||
// scenario-result shape so downstream consumers that rely on the shape
|
||||
// (parity gate, report render) stay aligned.
|
||||
scenarios: [
|
||||
{ name: "Scenario A", status: "pass" as const, steps: [] },
|
||||
{ name: "Scenario B", status: "fail" as const, details: "something broke", steps: [] },
|
||||
],
|
||||
startedAt: new Date("2026-04-11T00:00:00.000Z"),
|
||||
finishedAt: new Date("2026-04-11T00:05:00.000Z"),
|
||||
providerMode: "mock-openai" as const,
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
alternateModel: "openai/gpt-5.4-alt",
|
||||
fastMode: true,
|
||||
concurrency: 2,
|
||||
};
|
||||
|
||||
it("records provider/model/mode so parity gates can verify labels", () => {
|
||||
const json = buildQaSuiteSummaryJson(baseParams);
|
||||
expect(json.run).toMatchObject({
|
||||
startedAt: "2026-04-11T00:00:00.000Z",
|
||||
finishedAt: "2026-04-11T00:05:00.000Z",
|
||||
providerMode: "mock-openai",
|
||||
primaryModel: "openai/gpt-5.4",
|
||||
primaryProvider: "openai",
|
||||
primaryModelName: "gpt-5.4",
|
||||
alternateModel: "openai/gpt-5.4-alt",
|
||||
alternateProvider: "openai",
|
||||
alternateModelName: "gpt-5.4-alt",
|
||||
fastMode: true,
|
||||
concurrency: 2,
|
||||
scenarioIds: null,
|
||||
});
|
||||
});
|
||||
|
||||
it("includes scenarioIds in run metadata when provided", () => {
|
||||
const scenarioIds = ["approval-turn-tool-followthrough", "subagent-handoff", "memory-recall"];
|
||||
const json = buildQaSuiteSummaryJson({
|
||||
...baseParams,
|
||||
scenarioIds,
|
||||
});
|
||||
expect(json.run.scenarioIds).toEqual(scenarioIds);
|
||||
});
|
||||
|
||||
it("treats an empty scenarioIds array as unspecified (no filter)", () => {
|
||||
// A CLI path that omits --scenario passes an empty array to runQaSuite.
|
||||
// The summary must encode that as null so downstream parity/report
|
||||
// tooling doesn't interpret a full run as an explicit empty selection.
|
||||
const json = buildQaSuiteSummaryJson({
|
||||
...baseParams,
|
||||
scenarioIds: [],
|
||||
});
|
||||
expect(json.run.scenarioIds).toBeNull();
|
||||
});
|
||||
|
||||
it("records an Anthropic baseline lane cleanly for parity runs", () => {
|
||||
const json = buildQaSuiteSummaryJson({
|
||||
...baseParams,
|
||||
primaryModel: "anthropic/claude-opus-4-6",
|
||||
alternateModel: "anthropic/claude-sonnet-4-6",
|
||||
});
|
||||
expect(json.run).toMatchObject({
|
||||
primaryModel: "anthropic/claude-opus-4-6",
|
||||
primaryProvider: "anthropic",
|
||||
primaryModelName: "claude-opus-4-6",
|
||||
alternateModel: "anthropic/claude-sonnet-4-6",
|
||||
alternateProvider: "anthropic",
|
||||
alternateModelName: "claude-sonnet-4-6",
|
||||
});
|
||||
});
|
||||
|
||||
it("leaves split fields null when a model ref is malformed", () => {
|
||||
const json = buildQaSuiteSummaryJson({
|
||||
...baseParams,
|
||||
primaryModel: "not-a-real-ref",
|
||||
alternateModel: "",
|
||||
});
|
||||
expect(json.run).toMatchObject({
|
||||
primaryModel: "not-a-real-ref",
|
||||
primaryProvider: null,
|
||||
primaryModelName: null,
|
||||
alternateModel: "",
|
||||
alternateProvider: null,
|
||||
alternateModelName: null,
|
||||
});
|
||||
});
|
||||
|
||||
it("keeps scenarios and counts alongside the run metadata", () => {
|
||||
const json = buildQaSuiteSummaryJson(baseParams);
|
||||
expect(json.scenarios).toHaveLength(2);
|
||||
expect(json.counts).toEqual({
|
||||
total: 2,
|
||||
passed: 1,
|
||||
failed: 1,
|
||||
});
|
||||
});
|
||||
});
|
||||
@@ -81,7 +81,7 @@ type QaSuiteStep = {
|
||||
run: () => Promise<string | void>;
|
||||
};
|
||||
|
||||
type QaSuiteScenarioResult = {
|
||||
export type QaSuiteScenarioResult = {
|
||||
name: string;
|
||||
status: "pass" | "fail";
|
||||
steps: QaReportCheck[];
|
||||
@@ -1365,17 +1365,105 @@ function createQaSuiteReportNotes(params: {
|
||||
return params.transport.createReportNotes(params);
|
||||
}
|
||||
|
||||
export type QaSuiteSummaryJsonParams = {
|
||||
scenarios: QaSuiteScenarioResult[];
|
||||
startedAt: Date;
|
||||
finishedAt: Date;
|
||||
providerMode: QaProviderMode;
|
||||
primaryModel: string;
|
||||
alternateModel: string;
|
||||
fastMode: boolean;
|
||||
concurrency: number;
|
||||
scenarioIds?: readonly string[];
|
||||
};
|
||||
|
||||
/**
|
||||
* Strongly-typed shape of `qa-suite-summary.json`. The GPT-5.4 parity gate
|
||||
* (agentic-parity-report.ts, #64441) and any future parity wrapper can
|
||||
* import this type instead of re-declaring the shape, so changes to the
|
||||
* summary schema propagate through to every consumer at type-check time.
|
||||
*/
|
||||
export type QaSuiteSummaryJson = {
|
||||
scenarios: QaSuiteScenarioResult[];
|
||||
counts: {
|
||||
total: number;
|
||||
passed: number;
|
||||
failed: number;
|
||||
};
|
||||
run: {
|
||||
startedAt: string;
|
||||
finishedAt: string;
|
||||
providerMode: QaProviderMode;
|
||||
primaryModel: string;
|
||||
primaryProvider: string | null;
|
||||
primaryModelName: string | null;
|
||||
alternateModel: string;
|
||||
alternateProvider: string | null;
|
||||
alternateModelName: string | null;
|
||||
fastMode: boolean;
|
||||
concurrency: number;
|
||||
scenarioIds: string[] | null;
|
||||
};
|
||||
};
|
||||
|
||||
/**
|
||||
* Pure-ish JSON builder for qa-suite-summary.json. Exported so the GPT-5.4
|
||||
* parity gate (agentic-parity-report.ts, #64441) and any future parity
|
||||
* runner can assert-and-trust the provider/model that produced a given
|
||||
* summary instead of blindly accepting the caller's candidateLabel /
|
||||
* baselineLabel. Without the `run` block, a maintainer who swaps candidate
|
||||
* and baseline summary paths could silently produce a mislabeled verdict.
|
||||
*
|
||||
* `scenarioIds` is only recorded when the caller passed a non-empty array
|
||||
* (an explicit scenario selection). A missing or empty array means "no
|
||||
* filter, full lane-selected catalog", which the summary encodes as `null`
|
||||
* so parity/report tooling doesn't mistake a full run for an explicit
|
||||
* empty selection.
|
||||
*/
|
||||
export function buildQaSuiteSummaryJson(params: QaSuiteSummaryJsonParams): QaSuiteSummaryJson {
|
||||
const primarySplit = splitModelRef(params.primaryModel);
|
||||
const alternateSplit = splitModelRef(params.alternateModel);
|
||||
return {
|
||||
scenarios: params.scenarios,
|
||||
counts: {
|
||||
total: params.scenarios.length,
|
||||
passed: params.scenarios.filter((scenario) => scenario.status === "pass").length,
|
||||
failed: params.scenarios.filter((scenario) => scenario.status === "fail").length,
|
||||
},
|
||||
run: {
|
||||
startedAt: params.startedAt.toISOString(),
|
||||
finishedAt: params.finishedAt.toISOString(),
|
||||
providerMode: params.providerMode,
|
||||
primaryModel: params.primaryModel,
|
||||
primaryProvider: primarySplit?.provider ?? null,
|
||||
primaryModelName: primarySplit?.model ?? null,
|
||||
alternateModel: params.alternateModel,
|
||||
alternateProvider: alternateSplit?.provider ?? null,
|
||||
alternateModelName: alternateSplit?.model ?? null,
|
||||
fastMode: params.fastMode,
|
||||
concurrency: params.concurrency,
|
||||
scenarioIds:
|
||||
params.scenarioIds && params.scenarioIds.length > 0 ? [...params.scenarioIds] : null,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
async function writeQaSuiteArtifacts(params: {
|
||||
outputDir: string;
|
||||
startedAt: Date;
|
||||
finishedAt: Date;
|
||||
scenarios: QaSuiteScenarioResult[];
|
||||
transport: QaTransportAdapter;
|
||||
providerMode: "mock-openai" | "live-frontier";
|
||||
// Reuse the canonical QaProviderMode union instead of re-declaring it
|
||||
// inline. Loop 6 already unified `QaSuiteSummaryJsonParams.providerMode`
|
||||
// on this type; keeping the writer in sync prevents drift when model-
|
||||
// selection.ts adds a new provider mode.
|
||||
providerMode: QaProviderMode;
|
||||
primaryModel: string;
|
||||
alternateModel: string;
|
||||
fastMode: boolean;
|
||||
concurrency: number;
|
||||
scenarioIds?: readonly string[];
|
||||
}) {
|
||||
const report = renderQaMarkdownReport({
|
||||
title: "OpenClaw QA Scenario Suite",
|
||||
@@ -1395,18 +1483,7 @@ async function writeQaSuiteArtifacts(params: {
|
||||
await fs.writeFile(reportPath, report, "utf8");
|
||||
await fs.writeFile(
|
||||
summaryPath,
|
||||
`${JSON.stringify(
|
||||
{
|
||||
scenarios: params.scenarios,
|
||||
counts: {
|
||||
total: params.scenarios.length,
|
||||
passed: params.scenarios.filter((scenario) => scenario.status === "pass").length,
|
||||
failed: params.scenarios.filter((scenario) => scenario.status === "fail").length,
|
||||
},
|
||||
},
|
||||
null,
|
||||
2,
|
||||
)}\n`,
|
||||
`${JSON.stringify(buildQaSuiteSummaryJson(params), null, 2)}\n`,
|
||||
"utf8",
|
||||
);
|
||||
return { report, reportPath, summaryPath };
|
||||
@@ -1576,6 +1653,16 @@ export async function runQaSuite(params?: QaSuiteRunParams): Promise<QaSuiteResu
|
||||
alternateModel,
|
||||
fastMode,
|
||||
concurrency,
|
||||
// When the caller supplied an explicit non-empty --scenario filter,
|
||||
// record the executed (post-selectQaSuiteScenarios-normalized) ids
|
||||
// so the summary matches what actually ran. When the caller passed
|
||||
// nothing or an empty array ("no filter, full lane catalog"),
|
||||
// preserve the unfiltered = null semantic so the summary stays
|
||||
// distinguishable from an explicit all-scenarios selection.
|
||||
scenarioIds:
|
||||
params?.scenarioIds && params.scenarioIds.length > 0
|
||||
? selectedCatalogScenarios.map((scenario) => scenario.id)
|
||||
: undefined,
|
||||
});
|
||||
lab.setLatestReport({
|
||||
outputPath: reportPath,
|
||||
@@ -1737,6 +1824,12 @@ export async function runQaSuite(params?: QaSuiteRunParams): Promise<QaSuiteResu
|
||||
alternateModel,
|
||||
fastMode,
|
||||
concurrency,
|
||||
// Same "filtered → executed list, unfiltered → null" convention as
|
||||
// the concurrent-path writeQaSuiteArtifacts call above.
|
||||
scenarioIds:
|
||||
params?.scenarioIds && params.scenarioIds.length > 0
|
||||
? selectedCatalogScenarios.map((scenario) => scenario.id)
|
||||
: undefined,
|
||||
});
|
||||
const latestReport = {
|
||||
outputPath: reportPath,
|
||||
|
||||
@@ -151,6 +151,20 @@ steps:
|
||||
ref: imageStartedAtMs
|
||||
timeoutMs:
|
||||
expr: liveTurnTimeoutMs(env, 45000)
|
||||
# Tool-call assertion (criterion 2 of the parity completion
|
||||
# gate in #64227): the restored `image_generate` capability
|
||||
# must have actually fired as a real tool call. Without this
|
||||
# assertion, a prose reply that just mentions a MEDIA path
|
||||
# could satisfy the scenario, so strengthen it by requiring
|
||||
# the mock to have recorded `plannedToolName: "image_generate"`
|
||||
# against a post-restart request. The `!env.mock || ...`
|
||||
# guard means this check only runs in mock mode (where
|
||||
# `/debug/requests` is available); live-frontier runs skip
|
||||
# it and still pass the rest of the scenario.
|
||||
- assert:
|
||||
expr: "!env.mock || [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].some((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check') && request.plannedToolName === 'image_generate')"
|
||||
message:
|
||||
expr: "`expected image_generate tool call during capability flip scenario, saw plannedToolNames=${JSON.stringify([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check')).map((request) => request.plannedToolName ?? null))}`"
|
||||
finally:
|
||||
- call: patchConfig
|
||||
args:
|
||||
|
||||
@@ -64,9 +64,26 @@ steps:
|
||||
expr: "!missingColorGroup"
|
||||
message:
|
||||
expr: "`missing expected colors in image description: ${outbound.text}`"
|
||||
# Image-processing assertion: verify the mock actually received an
|
||||
# image on the scenario-unique prompt. This is as strong as a
|
||||
# tool-call assertion for this scenario — unlike the
|
||||
# `source-docs-discovery-report` / `subagent-handoff` /
|
||||
# `config-restart-capability-flip` scenarios that rely on a real
|
||||
# tool call to satisfy the parity criterion, image understanding
|
||||
# is handled inside the provider's vision capability and does NOT
|
||||
# emit a tool call the mock can record as `plannedToolName`. The
|
||||
# `imageInputCount` field IS the tool-call evidence for vision
|
||||
# scenarios: it proves the attachment reached the provider, which
|
||||
# is the only thing an external harness can verify in mock mode.
|
||||
# Match on the scenario-unique prompt substring so the assertion
|
||||
# can't be accidentally satisfied by some other scenario's image
|
||||
# request that happens to share a debug log with this one.
|
||||
- set: imageRequest
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].find((request) => String(request.prompt ?? '').includes('Image understanding check')) : null"
|
||||
- assert:
|
||||
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0) >= 1)"
|
||||
expr: "!env.mock || (imageRequest && (imageRequest.imageInputCount ?? 0) >= 1)"
|
||||
message:
|
||||
expr: "`expected at least one input image, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0)}`"
|
||||
expr: "`expected at least one input image on the Image understanding check request, got imageInputCount=${String(imageRequest?.imageInputCount ?? 0)}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
127
qa/scenarios/instruction-followthrough-repo-contract.md
Normal file
127
qa/scenarios/instruction-followthrough-repo-contract.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Instruction followthrough repo contract
|
||||
|
||||
```yaml qa-scenario
|
||||
id: instruction-followthrough-repo-contract
|
||||
title: Instruction followthrough repo contract
|
||||
surface: repo-contract
|
||||
objective: Verify the agent reads repo instruction files first, follows the required tool order, and completes the first feasible action instead of stopping at a plan.
|
||||
successCriteria:
|
||||
- Agent reads the seeded instruction files before writing the requested artifact.
|
||||
- Agent writes the requested artifact in the same run instead of returning only a plan.
|
||||
- Agent does not ask for permission before the first feasible action.
|
||||
- Final reply makes the completed read/write sequence explicit.
|
||||
docsRefs:
|
||||
- docs/help/testing.md
|
||||
- docs/channels/qa-channel.md
|
||||
codeRefs:
|
||||
- src/agents/system-prompt.ts
|
||||
- src/agents/pi-embedded-runner/run/incomplete-turn.ts
|
||||
- extensions/qa-lab/src/mock-openai-server.ts
|
||||
execution:
|
||||
kind: flow
|
||||
summary: Verify the agent reads repo instructions first, then completes the first bounded followthrough task without stalling.
|
||||
config:
|
||||
workspaceFiles:
|
||||
AGENT.md: |-
|
||||
# Repo contract
|
||||
|
||||
Step order:
|
||||
1. Read AGENT.md.
|
||||
2. Read SOUL.md.
|
||||
3. Read FOLLOWTHROUGH_INPUT.md.
|
||||
4. Write ./repo-contract-summary.txt.
|
||||
5. Reply with three labeled lines exactly once: Read, Wrote, Status.
|
||||
|
||||
Do not stop after planning.
|
||||
Do not ask for permission before the first feasible action.
|
||||
SOUL.md: |-
|
||||
# Execution style
|
||||
|
||||
Stay brief, honest, and action-first.
|
||||
If the next tool action is feasible, do it before replying.
|
||||
FOLLOWTHROUGH_INPUT.md: |-
|
||||
Mission: prove you followed the repo contract.
|
||||
Evidence path: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md -> repo-contract-summary.txt
|
||||
prompt: |-
|
||||
Repo contract followthrough check. Read AGENT.md, SOUL.md, and FOLLOWTHROUGH_INPUT.md first.
|
||||
Then follow the repo contract exactly, write ./repo-contract-summary.txt, and reply with
|
||||
three labeled lines: Read, Wrote, Status.
|
||||
Do not stop after planning and do not ask for permission before the first feasible action.
|
||||
expectedReplyAll:
|
||||
- "read:"
|
||||
- "wrote:"
|
||||
- "status:"
|
||||
forbiddenNeedles:
|
||||
- need permission
|
||||
- need your approval
|
||||
- can you approve
|
||||
- i would
|
||||
- i can
|
||||
- next i would
|
||||
```
|
||||
|
||||
```yaml qa-flow
|
||||
steps:
|
||||
- name: follows repo instructions instead of stopping at a plan
|
||||
actions:
|
||||
- call: reset
|
||||
- forEach:
|
||||
items:
|
||||
expr: "Object.entries(config.workspaceFiles ?? {})"
|
||||
item: workspaceFile
|
||||
actions:
|
||||
- call: fs.writeFile
|
||||
args:
|
||||
- expr: "path.join(env.gateway.workspaceDir, String(workspaceFile[0]))"
|
||||
- expr: "`${String(workspaceFile[1] ?? '').trimEnd()}\\n`"
|
||||
- utf8
|
||||
- set: artifactPath
|
||||
value:
|
||||
expr: "path.join(env.gateway.workspaceDir, 'repo-contract-summary.txt')"
|
||||
- call: runAgentPrompt
|
||||
args:
|
||||
- ref: env
|
||||
- sessionKey: agent:qa:repo-contract
|
||||
message:
|
||||
expr: config.prompt
|
||||
timeoutMs:
|
||||
expr: liveTurnTimeoutMs(env, 40000)
|
||||
- call: waitForCondition
|
||||
saveAs: artifact
|
||||
args:
|
||||
- lambda:
|
||||
async: true
|
||||
expr: "((await fs.readFile(artifactPath, 'utf8').catch(() => null))?.includes('Mission: prove you followed the repo contract.') ? await fs.readFile(artifactPath, 'utf8').catch(() => null) : undefined)"
|
||||
- expr: liveTurnTimeoutMs(env, 30000)
|
||||
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
|
||||
- set: expectedReplyAll
|
||||
value:
|
||||
expr: config.expectedReplyAll.map(normalizeLowercaseStringOrEmpty)
|
||||
- call: waitForCondition
|
||||
saveAs: outbound
|
||||
args:
|
||||
- lambda:
|
||||
expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && expectedReplyAll.every((needle) => normalizeLowercaseStringOrEmpty(candidate.text).includes(needle))).at(-1)"
|
||||
- expr: liveTurnTimeoutMs(env, 30000)
|
||||
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
|
||||
- assert:
|
||||
expr: "!config.forbiddenNeedles.some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
|
||||
message:
|
||||
expr: "`repo contract followthrough bounced for permission or stalled: ${outbound.text}`"
|
||||
- set: followthroughDebugRequests
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => /repo contract followthrough check/i.test(String(request.allInputText ?? ''))) : []"
|
||||
- assert:
|
||||
expr: "!env.mock || followthroughDebugRequests.filter((request) => request.plannedToolName === 'read').length >= 3"
|
||||
message:
|
||||
expr: "`expected three read tool calls before write, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
- assert:
|
||||
expr: "!env.mock || followthroughDebugRequests.some((request) => request.plannedToolName === 'write')"
|
||||
message:
|
||||
expr: "`expected write tool call during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
- assert:
|
||||
expr: "!env.mock || (() => { const readIndices = followthroughDebugRequests.map((r, i) => r.plannedToolName === 'read' ? i : -1).filter(i => i >= 0); const firstWrite = followthroughDebugRequests.findIndex((r) => r.plannedToolName === 'write'); return readIndices.length >= 3 && firstWrite >= 0 && readIndices[2] < firstWrite; })()"
|
||||
message:
|
||||
expr: "`expected all 3 reads before any write during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
@@ -1,5 +1,36 @@
|
||||
# Memory recall after context switch
|
||||
|
||||
<!--
|
||||
This scenario deliberately stays prose-only and does NOT gate on a
|
||||
`/debug/requests` tool-call assertion, even though it is one of the
|
||||
scenarios in the parity pack. The adversarial review in the umbrella
|
||||
#64227 thread called this out as a coverage gap, but the underlying
|
||||
behavior the scenario tests is legitimately prose-shaped: the agent is
|
||||
supposed to pull a prior-turn fact ("ALPHA-7") back across an
|
||||
intervening context switch and reply with the code. In a real
|
||||
conversation, the model can do this EITHER by calling a memory-search
|
||||
tool (which the qa-lab mock server doesn't currently expose) OR by
|
||||
reading the fact directly from prior-turn context in its own
|
||||
conversation window. Both strategies are valid parity behavior.
|
||||
|
||||
Forcing a `plannedToolName` assertion here would either require
|
||||
extending the mock with a synthetic `memory_search` tool lane (PR O
|
||||
scope, not PR J) or fabricating a tool-call requirement the real
|
||||
providers never implement. Either path would make this scenario test
|
||||
the harness, not the models. So we keep it prose-only, covered by the
|
||||
`recallExpectedAny` / `rememberAckAny` assertions above, and flag the
|
||||
exception explicitly rather than silently.
|
||||
|
||||
Criterion 2 of the parity completion gate (no fake progress or fake
|
||||
tool completion) is enforced for this scenario through the parity
|
||||
report's failure-tone fake-success detector: a scenario marked `pass`
|
||||
whose details text matches patterns like "timed out", "failed to",
|
||||
"could not" gets flagged via `SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS`
|
||||
in `extensions/qa-lab/src/agentic-parity-report.ts`. Positive-tone
|
||||
detection was removed because it false-positives on legitimate passes
|
||||
where the details field is the model's outbound prose.
|
||||
-->
|
||||
|
||||
```yaml qa-scenario
|
||||
id: memory-recall
|
||||
title: Memory recall after context switch
|
||||
|
||||
@@ -69,13 +69,22 @@ steps:
|
||||
expr: hasModelSwitchContinuityEvidence(outbound.text)
|
||||
message:
|
||||
expr: "`switch reply missed kickoff continuity: ${outbound.text}`"
|
||||
- assert:
|
||||
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName) === 'read')"
|
||||
message:
|
||||
expr: "`expected read after switch, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName ?? '')}`"
|
||||
- assert:
|
||||
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model) === 'gpt-5.4-alt')"
|
||||
message:
|
||||
expr: "`expected alternate model, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model ?? '')}`"
|
||||
- if:
|
||||
expr: "Boolean(env.mock)"
|
||||
then:
|
||||
- set: switchDebugRequests
|
||||
value:
|
||||
expr: "await fetchJson(`${env.mock.baseUrl}/debug/requests`)"
|
||||
- set: switchRequest
|
||||
value:
|
||||
expr: "switchDebugRequests.find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))"
|
||||
- assert:
|
||||
expr: "switchRequest?.plannedToolName === 'read'"
|
||||
message:
|
||||
expr: "`expected read after switch, got ${String(switchRequest?.plannedToolName ?? '')}`"
|
||||
- assert:
|
||||
expr: "String(switchRequest?.model ?? '') === String(alternate?.model ?? '')"
|
||||
message:
|
||||
expr: "`expected alternate model, got ${String(switchRequest?.model ?? '')}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
@@ -56,5 +56,20 @@ steps:
|
||||
expr: "!reportsDiscoveryScopeLeak(outbound.text)"
|
||||
message:
|
||||
expr: "`discovery report drifted beyond scope: ${outbound.text}`"
|
||||
# Parity gate criterion 2 (no fake progress / fake tool completion):
|
||||
# require an actual read tool call before the prose report. Without this,
|
||||
# a model could fabricate a plausible Worked/Failed/Blocked/Follow-up
|
||||
# report without ever touching the repo files the prompt names. The
|
||||
# debug request log is fetched once and reused for both the assertion
|
||||
# and its failure-message diagnostic. Each request's allInputText is
|
||||
# lowercased inline at match time (the real prompt writes it as
|
||||
# "Worked, Failed, Blocked") so the contains check is case-insensitive.
|
||||
- set: discoveryDebugRequests
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
|
||||
- assert:
|
||||
expr: "!env.mock || discoveryDebugRequests.some((request) => String(request.allInputText ?? '').toLowerCase().includes('worked, failed, blocked') && request.plannedToolName === 'read')"
|
||||
message:
|
||||
expr: "`expected at least one read tool call during discovery report scenario, saw plannedToolNames=${JSON.stringify(discoveryDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
@@ -113,6 +113,28 @@ steps:
|
||||
expr: "sawAlpha && sawBeta"
|
||||
message:
|
||||
expr: "`fanout child sessions missing (alpha=${String(sawAlpha)} beta=${String(sawBeta)})`"
|
||||
# Tool-call assertion (criterion 2 of the
|
||||
# parity completion gate in #64227): the
|
||||
# scenario must have actually invoked
|
||||
# `sessions_spawn` at least twice with
|
||||
# distinct labels, not just ended up with
|
||||
# two rows in the session store through
|
||||
# prose trickery. The session store alone
|
||||
# can be populated by other flows or by a
|
||||
# model that fabricates "delegation"
|
||||
# narration. `plannedToolName` on the
|
||||
# mock's `/debug/requests` log is the
|
||||
# tool-call ground truth: two recorded
|
||||
# sessions_spawn requests with distinct
|
||||
# labels means the model really dispatched
|
||||
# both subagents.
|
||||
- set: fanoutSpawnRequests
|
||||
value:
|
||||
expr: "[...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => request.plannedToolName === 'sessions_spawn' && /subagent fanout synthesis check/i.test(String(request.allInputText ?? '')))"
|
||||
- assert:
|
||||
expr: "fanoutSpawnRequests.length >= 2"
|
||||
message:
|
||||
expr: "`expected at least two sessions_spawn tool calls during subagent fanout scenario, saw ${fanoutSpawnRequests.length}`"
|
||||
- set: details
|
||||
value:
|
||||
expr: "outbound.text"
|
||||
|
||||
@@ -46,5 +46,25 @@ steps:
|
||||
expr: "!['failed to delegate','could not delegate','subagent unavailable'].some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
|
||||
message:
|
||||
expr: "`subagent handoff reported failure: ${outbound.text}`"
|
||||
# Parity gate criterion 2 (no fake progress / fake tool completion):
|
||||
# require an actual sessions_spawn tool call. Without this, a model
|
||||
# could produce the three labeled sections ("Delegated task", "Result",
|
||||
# "Evidence") as free-form prose without ever delegating to a real
|
||||
# subagent. The assertion is pinned to THIS scenario by matching the
|
||||
# scenario-unique prompt substring "Delegate one bounded QA task"
|
||||
# (not a broad /delegate|subagent/ regex) so the earlier
|
||||
# subagent-fanout-synthesis scenario — which also contains "delegate"
|
||||
# and produces its own pre-tool sessions_spawn request — cannot
|
||||
# satisfy the assertion here. The match is also constrained to
|
||||
# pre-tool requests (no toolOutput) because the mock only plans
|
||||
# sessions_spawn on requests with no toolOutput; the follow-up
|
||||
# request after the tool runs has plannedToolName unset.
|
||||
- set: subagentDebugRequests
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
|
||||
- assert:
|
||||
expr: "!env.mock || subagentDebugRequests.some((request) => !request.toolOutput && /delegate one bounded qa task/i.test(String(request.allInputText ?? '')) && request.plannedToolName === 'sessions_spawn')"
|
||||
message:
|
||||
expr: "`expected sessions_spawn tool call during subagent handoff scenario, saw plannedToolNames=${JSON.stringify(subagentDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
@@ -1 +1 @@
|
||||
b92daceecab88cdb1ceeab30a7321399850a1fd13773af22dbb2035d39cdd5f8
|
||||
1d087c0991987824d78c8ac4ec2c0e66d661f4bd4afd12b193d66634c69d75a0
|
||||
|
||||
Reference in New Issue
Block a user