* Feat: LM Studio Integration
* Format
* Support usage in streaming true
Fix token count
* Add custom window check
* Drop max tokens fallback
* tweak docs
Update generated
* Avoid error if stale header does not resolve
* Fix test
* Fix test
* Fix rebase issues
Trim code
* Fix tests
Drop keyless
Fixes
* Fix linter issues in tests
* Update generated artifacts
* Do not have fatal header resoltuion for discovery
* Do the same for API key as well
* fix: honor lmstudio preload runtime auth
* fix: clear stale lmstudio header auth
* fix: lazy-load lmstudio runtime facade
* fix: preserve lmstudio shared synthetic auth
* fix: clear stale lmstudio header auth in discovery
* fix: prefer lmstudio header auth for discovery
* fix: honor lmstudio header auth in warmup paths
* fix: clear stale lmstudio profile auth
* fix: ignore lmstudio env auth on header migration
* fix: use local lmstudio setup seam
* fix: resolve lmstudio rebase fallout
---------
Co-authored-by: Frank Yang <frank.ekn@gmail.com>
* test(qa): gate parity prose scenarios on real tool calls
Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no
fake progress / fake tool completion') for the two first/second-wave
parity scenarios that can currently pass with a prose-only reply.
Background: the scenario framework already exposes tool-call assertions
via /debug/requests on the mock server (see approval-turn-tool-followthrough
for the pattern). Most parity scenarios use this seam to require a specific
plannedToolName, but source-docs-discovery-report and subagent-handoff
only checked the assistant's prose text, which means a model could fabricate:
- a Worked / Failed / Blocked / Follow-up report without ever calling
the read tool on the docs / source files the prompt named
- three labeled 'Delegated task', 'Result', 'Evidence' sections without
ever calling sessions_spawn to delegate
Both gaps are fake-progress loopholes for the parity gate.
Changes:
- source-docs-discovery-report: require at least one read tool call tied
to the 'worked, failed, blocked' prompt in /debug/requests. Failure
message dumps the observed plannedToolName list for debugging.
- subagent-handoff: require at least one sessions_spawn tool call tied
to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same
debug-friendly failure message.
Both assertions are gated behind !env.mock so they no-op in live-frontier
mode where the real provider exposes plannedToolName through a different
channel (or not at all).
Not touched: memory-recall is also in the parity pack but its pass path
is legitimately 'read the fact from prior-turn context'. That is a valid
recall strategy, not fake progress, so it is out of scope for this PR.
memory-recall's fake-progress story (no real memory_search call) would
require bigger mock-server changes and belongs in a follow-up that
extends the mock memory pipeline.
Validation:
- pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
Refs #64227
* test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch
Addresses loop-6 review feedback on PR #64681:
1. Copilot / Greptile / codex-connector all flagged that the discovery
scenario's .includes('worked, failed, blocked') assertion is
case-sensitive but the real prompt says 'Worked, Failed, Blocked...',
so the mock-mode assertion never matches. Fix: lowercase-normalize
allInputText before the contains check.
2. Greptile P2: the expr and message.expr each called fetchJson
separately, incurring two round-trips to /debug/requests. Fix: hoist
the fetch to a set step (discoveryDebugRequests / subagentDebugRequests)
and reuse the snapshot.
3. Copilot: the subagent-handoff assertion scanned the entire request
log and matched the first request with 'delegate' in its input text,
which could false-pass on a stale prior scenario. Fix: reverse the
array and take the most recent matching request instead.
Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).
Refs #64227
* test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests
Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I
used on pass 1 usually lands on the FOLLOW-UP request after the mock
runs sessions_spawn, not the pre-tool planning request that actually
has plannedToolName === 'sessions_spawn'. The mock only plans that tool
on requests with !toolOutput (mock-openai-server.ts:662), so the
post-tool request has plannedToolName unset and the assertion fails
even when the handoff succeeded.
Fix: switch the assertion back to a forward .some() match but add a
!request.toolOutput filter so the match is pinned to the pre-tool
planning phase. The case-insensitive regex, the fetchJson dedupe, and
the failure-message diagnostic from pass 1 are unchanged.
Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).
Refs #64227
* test(qa): pin subagent-handoff tool-call assertion to scenario prompt
Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix
filtered to pre-tool requests but still used a broad
`/delegate|subagent handoff/i` regex. The `subagent-fanout-synthesis`
scenario runs BEFORE `subagent-handoff` in catalog order (scenarios
are sorted by path), and the fanout prompt reads
'Subagent fanout synthesis check: delegate exactly two bounded
subagents sequentially' — which contains 'delegate' and also plans
sessions_spawn pre-tool. That produces a cross-scenario false pass
where the fanout's earlier sessions_spawn request satisfies the
handoff assertion even when the handoff run never delegates.
Fix: tighten the input-text match from `/delegate|subagent handoff/i`
to `/delegate one bounded qa task/i`, which is the exact scenario-
unique substring from the `subagent-handoff` config.prompt. That
pins the assertion to this scenario's request window and closes the
cross-scenario false positive.
Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).
Refs #64227
* test(qa): align parity assertion comments with actual filter logic
Addresses two loop-7 Copilot findings on PR #64681:
1. source-docs-discovery-report.md: the explanatory comment said the
debug request log was 'lowercased for case-insensitive matching',
but the code actually lowercases each request's allInputText inline
inside the .some() predicate, not the discoveryDebugRequests
snapshot. Rewrite the comment to describe the inline-lowercase
pattern so a future reader matches the code they see.
2. subagent-handoff.md: the comment said the assertion 'must be
pinned to THIS scenario's request window' but the implementation
actually relies on matching a scenario-unique prompt substring
(/delegate one bounded qa task/i), not a request-window. Rewrite
the comment to describe the substring pinning and keep the
pre-tool filter rationale intact.
No runtime change; comment-only fix to keep reviewer expectations
aligned with the actual assertion shape.
Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).
Refs #64227
* test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios
* Guard mock-only image parity assertions
* Expand agentic parity second wave
* test(qa): pad parity suspicious-pass isolation to second wave
* qa-lab: parametrize parity report title and drop stale first-wave comment
Addresses two loop-7 Copilot findings on PR #64662:
1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a
template string that interpolates candidateLabel and baselineLabel, so
any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate
title in saved reports. Default CLI flags still produce
openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair.
2. Stale 'declared first-wave parity scenarios' comment in
scopeSummaryToParityPack: the parity pack is now the ten-scenario
first-wave+second-wave set (PR D + PR E). Comment updated to drop
the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS
constant the scope is filtering against.
New regression: 'parametrizes the markdown header from the comparison
labels' — asserts that non-default labels (openai/gpt-5.4-alt vs
openai/gpt-5.4) render in the H1.
Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts
(13/13 pass).
Refs #64227
* qa-lab: fail parity gate on required scenario failures regardless of baseline parity
* test(qa): update readable-report test to cover all 10 parity scenarios
* qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels
* Tighten parity label and scenario checks
* fix: tighten parity label provenance checks
* fix: scope parity tool-call metrics to tool lanes
* Fix parity report label and fake-success checks
* fix(qa): tighten parity report edge cases
* qa-lab: add Anthropic /v1/messages mock route for parity baseline
Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity
completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats
Opus 4.6 on the agreed metrics').
Background: the parity gate needs two comparable scenario runs - one
against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the
aggregate metrics and verdict in PR D (#64441) can be computed. Today the
qa-lab mock server only implements /v1/responses, so the baseline run
against Claude Opus 4.6 requires a real Anthropic API key. That makes the
gate impossible to prove end-to-end from a local worktree and means the
CI story is always 'two real providers + quota + keys'.
This PR adds a /v1/messages Anthropic-compatible route to the existing
mock OpenAI server. The route is a thin adapter that:
- Parses Anthropic Messages API request shapes (system as string or
[{type:text,text}], messages with string or block content, text and
tool_result and tool_use and image blocks)
- Translates them into the ResponsesInputItem[] shape the existing shared
scenario dispatcher (buildResponsesPayload) already understands
- Calls the shared dispatcher so both the OpenAI and Anthropic lanes run
through the exact same scenario prompt-matching logic (same subagent
fanout state machine, same extractRememberedFact helper, same
'/debug/requests' telemetry)
- Converts the resulting OpenAI-format events back into an Anthropic
message response with text and tool_use content blocks and a correct
stop_reason (tool_use vs end_turn)
Non-streaming only: the QA suite runner falls back to non-streaming mock
mode so real Anthropic SSE isn't necessary for the parity baseline.
Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline
model-list probes from the suite runner resolve without extra config.
Tests added:
- advertises Anthropic claude-opus-4-6 baseline model on /v1/models
- dispatches an Anthropic /v1/messages read tool call for source discovery
prompts (tool_use stop_reason, correct input path, /debug/requests
records plannedToolName=read)
- dispatches Anthropic /v1/messages tool_result follow-ups through the
shared scenario logic (subagent-handoff two-stage flow: tool_use -
tool_result - 'Delegated task / Evidence' prose summary)
Local validation:
- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass)
- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass)
Refs #64227
Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper
by giving the baseline lane a local-only mock path.
* qa-lab: fix Anthropic tool_result ordering in messages adapter
Addresses the loop-6 Copilot / Greptile finding on PR #64685: in
`convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were
pushed to `items` inside the per-block loop while the surrounding
user/assistant message was only pushed after the loop finished. That
reordered the function_call_output BEFORE its parent user message
whenever a user turn mixed `tool_result` with fresh text/image blocks,
which broke `extractToolOutput` (it scans AFTER the last user-role
index; function_call_output placed BEFORE that index is invisible to it)
and made the downstream scenario dispatcher behave as if no tool output
had been returned on mixed-content turns.
Fix: buffer `tool_result` and `tool_use` blocks in local arrays during
the per-block loop, push the parent role message first (when it has any
text/image pieces), then push the accumulated function_call /
function_call_output items in original order. tool_result-only user
turns still omit the parent message as before, so the non-mixed
subagent-fanout-synthesis two-stage flow that already worked keeps
working.
Regression added:
- `places tool_result after the parent user message even in mixed-content
turns` — sends a user turn that mixes a `tool_result` block with a
trailing fresh text block, then inspects `/debug/last-request` to
assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found
the function_call_output AFTER the last user index) and
`prompt === 'Keep going with the fanout.'` (extractLastUserText picked
up the trailing fresh text).
Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts
(19/19 pass).
Refs #64227
* qa-lab: reject Anthropic streaming and empty model in messages mock
* qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider
* Handle invalid Anthropic mock JSON
* fix: wire mock parity providers by model ref
* fix(qa): support Anthropic message streaming in mock parity lane
* qa-lab: record provider/model/mode in qa-suite-summary.json
Closes the 'summary cannot be label-verified' half of criterion 5 on the
GPT-5.4 parity completion gate in #64227.
Background: the parity gate in #64441 compares two qa-suite-summary.json
files and trusts whatever candidateLabel / baselineLabel the caller
passes. Today the summary JSON only contains { scenarios, counts }, so
nothing in the summary records which provider/model the run actually
used. If a maintainer swaps candidate and baseline summary paths in a
parity-report call, the verdict is silently mislabeled and nobody can
retroactively verify which run produced which summary.
Changes:
- Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt,
providerMode, primaryModel (+ provider and model splits),
alternateModel (+ provider and model splits), fastMode, concurrency,
scenarioIds (when explicitly filtered).
- Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary
JSON shape is unit-testable and the parity gate (and any future parity
wrapper) can import the exact same type rather than reverse-engineering
the JSON shape at runtime.
- Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so
--scenario-ids flags are recorded in the summary.
Unit tests added (src/suite.summary-json.test.ts, 5 cases):
- records provider/model/mode so parity gates can verify labels
- includes scenarioIds in run metadata when provided
- records an Anthropic baseline lane cleanly for parity runs
- leaves split fields null when a model ref is malformed
- keeps scenarios and counts alongside the run metadata
This is additive: existing callers of qa-suite-summary.json continue to
see the same { scenarios, counts } shape, just with an extra run field.
No existing consumers of the JSON need to change.
The follow-up 'qa parity run' CLI wrapper (run the parity pack twice
against candidate + baseline, emit two labeled summaries in one command)
stacks cleanly on top of this change and will land as a separate PR
once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand
directly.
Local validation:
- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass)
- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass)
Refs #64227
Unblocks the final parity run for #64441 / #64662 by making summaries
self-describing.
* qa-lab: strengthen qa-suite-summary builder types and empty-array semantics
Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689
(re-opened as #64789):
1. P2 codex + Copilot: empty `scenarioIds` array was serialized as
`[]` because of a truthiness check. The CLI passes an empty array
when --scenario is omitted, so full-suite runs would incorrectly
record an explicit empty selection. Fix: switch to a
`length > 0` check so '[] or undefined' both encode as `null`
in the summary run metadata.
2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate
consumers but its return type was `Record<string, unknown>`, which
defeated the point of exporting it. Fix: introduce a concrete
`QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and
make the builder return it. Downstream code (parity gate, parity
run wrapper) can now import the type and keep consumers
type-checked.
3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the
`'mock-openai' | 'live-frontier'` string union even though
`QaProviderMode` is already imported from model-selection.ts. Fix:
reuse `QaProviderMode` so provider-mode additions flow through
both types at once.
4. Copilot: test fixtures omitted `steps` from the fake scenario
results, creating shape drift with the real suite scenario-result
shape. Fix: pad the test fixtures with `steps: []` and tighten the
scenarioIds assertion to read `json.run.scenarioIds` directly (the
new concrete return type makes the type-cast unnecessary).
New regression: `treats an empty scenarioIds array as unspecified
(no filter)` — passes `scenarioIds: []` and asserts the summary
records `scenarioIds: null`.
Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).
Refs #64227
* qa-lab: record executed scenarioIds in summary run metadata
Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689):
`run.scenarioIds` was copied from the raw `params.scenarioIds`
caller input, but `runQaSuite` normalizes that input through
`selectQaSuiteScenarios` which dedupes via `Set` and reorders the
selection to catalog order. When callers repeat --scenario ids or
pass them in non-catalog order, the summary metadata drifted from
the scenarios actually executed, which can make parity/report
tooling treat equivalent runs as different or trust inaccurate
provenance.
Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass
`selectedCatalogScenarios.map(scenario => scenario.id)` instead of
`params?.scenarioIds`, so the summary records the post-selection
executed list. This also covers the full-suite case automatically
(the executed list is the full lane-filtered catalog), giving parity
consumers a stable record of exactly which scenarios landed in the
run regardless of how the caller phrased the request.
buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2
semantics are preserved so the public helper still treats an empty
array as 'unspecified' for any future caller that legitimately passes
one.
Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).
Refs #64227
* qa-lab: preserve null scenarioIds for unfiltered suite runs
Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix
always passed `selectedCatalogScenarios.map(...)` to
writeQaSuiteArtifacts, which made unfiltered full-suite runs
indistinguishable from an explicit all-scenarios selection in the
summary metadata. The 'unfiltered → null' semantic (documented in
the buildQaSuiteSummaryJson JSDoc and exercised by the
"treats an empty scenarioIds array as unspecified" regression) was
lost.
Fix: both writeQaSuiteArtifacts call sites now condition on the
caller's original `params.scenarioIds`. When the caller passed an
explicit non-empty filter, record the post-selection executed list
(pass-3 behavior, preserving Set-dedupe + catalog-order
normalization). When the caller passed undefined or an empty array,
pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's
length-check serializes null (pass-2 behavior, preserving unfiltered
semantics).
This keeps both codex-connector findings satisfied simultaneously:
- explicit --scenario filter reorders/dedupes through the executed
list, not the raw caller input
- unfiltered full-suite run records null, not a full catalog dump
that would shadow "explicit all-scenarios" selections
Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).
Refs #64227
* qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type
* qa-lab: stage mock auth profiles so the parity gate runs without real credentials
* fix(qa): clean up mock auth staging follow-ups
* ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock
* ci: use supported parity gate runner label
* ci: watch gateway changes in parity gate
* docs: pin parity runbook alternate models
* fix(ci): watch qa-channel parity inputs
* qa: roll up parity proof closeout
* qa: harden mock parity review fixes
* qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection
* qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch
* qa-lab: clean up positive-tone comment + fix stale test expectations
* qa: pin workflow Node version to 22.14.0 + fix stale label-match wording
* qa-lab: refresh mock provider routing expectation
* docs: drop stale parity rollup rewrite from proof slice
* qa: run parity gate against mock lane
* deps: sync qa-lab lockfile
* build: refresh a2ui bundle hash
* ci: widen parity gate triggers
---------
Co-authored-by: Eva <eva@100yen.org>
startGatewayRuntimeServices() previously started both the cron
scheduler AND heartbeat runner BEFORE gateway sidecars finished
initialising. Because chat.history is marked unavailable until
sidecars complete, any cron job or heartbeat tick that called
chat.history during this window received a hard UNAVAILABLE error.
Fix: create a noop heartbeat placeholder in the early
startGatewayRuntimeServices() call, then activate the real
heartbeat runner, cron scheduler, and pending delivery recovery
in a new activateGatewayScheduledServices() function that runs
AFTER startGatewayPostAttachRuntime() completes.
channelHealthMonitor and model pricing refresh remain in the
early call since they do not depend on chat.history.
Root cause analysis by luban, cross-validated by tongluo.
Reviewer feedback addressed: heartbeat runner is now also
deferred (previously only cron was deferred).
* agents: auto-activate strict-agentic for GPT-5 and emit blocked-exit liveness
Closes two hard blockers on the GPT-5.4 parity completion gate:
1) Criterion 1 (no stalls after planning) is universal, but the pre-existing
strict-agentic execution contract was opt-in only. Out-of-the-box GPT-5
openai / openai-codex users who never set
`agents.defaults.embeddedPi.executionContract` still got only 1
planning-only retry and then fell through to the normal completion path
with the plan-only text, i.e. they still stalled.
Introduce `resolveEffectiveExecutionContract(...)` in
src/agents/execution-contract.ts. Behavior:
- supported provider/model (openai or openai-codex + gpt-5-family) AND
explicit "strict-agentic" or unspecified → "strict-agentic"
- supported provider/model AND explicit "default" → "default" (opt-out)
- unsupported provider/model → "default" regardless of explicit value
`isStrictAgenticExecutionContractActive` now delegates to the effective
resolver so the 2-retry + blocked-state treatment applies by default to
every GPT-5 openai/codex run. Explicit opt-out still works for users who
intentionally want the pre-parity-program behavior.
2) Criterion 4 (replay/liveness failures are explicit, not silent
disappearance) is violated by the strict-agentic blocked exit itself.
Every other terminal return path in src/agents/pi-embedded-runner/run.ts
sets `replayInvalid` + `livenessState` via `setTerminalLifecycleMeta`,
but the strict-agentic exit at run.ts:1615 falls through without them.
Add explicit `livenessState: "abandoned"` + `replayInvalid` (via the
shared `resolveReplayInvalidForAttempt` helper) to that exit, plus a
`setTerminalLifecycleMeta` call so downstream observers (lifecycle log,
ACP bridge, telemetry) see the same explicit terminal state they see on
every other exit branch.
Regressions added:
- `auto-enables update_plan for unconfigured GPT-5 openai runs`
- `respects explicit default contract opt-out on GPT-5 runs`
- `does not auto-enable update_plan for non-openai providers even when unconfigured`
- `emits explicit replayInvalid + abandoned liveness state at the strict-agentic blocked exit`
- `auto-activates strict-agentic for unconfigured GPT-5 openai runs and surfaces the blocked state`
- `respects explicit default contract opt-out on GPT-5 openai runs`
Local validation:
- pnpm test src/agents/openclaw-tools.update-plan.test.ts src/agents/pi-embedded-runner/run.incomplete-turn.test.ts src/agents/pi-embedded-runner.buildembeddedsandboxinfo.test.ts src/agents/system-prompt.test.ts src/agents/openclaw-tools.sessions.test.ts src/agents/pi-embedded-runner/run.overflow-compaction.test.ts
122/122 passing.
Refs #64227
* agents: address loop-6 review comments on strict-agentic contract
Triages all three loop-6 review comments on PR #64679:
1. Copilot: 'The strict-agentic blocked exit returns an error payload
(isError: true) but sets livenessState to "abandoned". Elsewhere in
the runner/lifecycle flow, error terminal states are treated as
"blocked".' Verified: every other hardcoded error terminal branch in
run.ts (role ordering at 1152, image size at 1206, schema error at
1244, compaction timeout at 1128, aborted-with-no-payloads at 606)
uses livenessState: "blocked". Match that convention at the
strict-agentic blocked exit at 1634. Updated the 'emits explicit
replayInvalid + abandoned liveness state' regression test to assert
the new "blocked" value and renamed the assertion commentary.
2. Copilot: 'The JSDoc for resolveEffectiveExecutionContract says
explicit "strict-agentic" in config always resolves to
"strict-agentic", but the implementation collapses to "default"
whenever the provider/mode is unsupported.' Rewrite the JSDoc to
explicitly document the unsupported-provider collapse as the lead
case (strict-agentic is a GPT-5-family openai/openai-codex-only
runtime contract) before listing the supported-lane behavior matrix.
No code change; this is a docstring-only clarification.
3. Greptile P2: 'Non-preferred Anthropic model constant. CLAUDE.md says
to prefer sonnet-4.6 for Anthropic test constants.' Swap
claude-opus-4-6 → claude-sonnet-4-6 in the two update_plan gating
fixtures that assert non-openai providers don't auto-enable the
planning tool. Behavior unchanged; model constant now matches repo
testing guidance.
Local validation:
- pnpm test src/agents/openclaw-tools.update-plan.test.ts src/agents/pi-embedded-runner/run.incomplete-turn.test.ts
29/29 passing.
Refs #64227
* test: rename strict-agentic blocked-exit liveness regression to match blocked state
Addresses loop-7 Copilot finding on PR #64679: loop 6 changed the
assertion to livenessState === 'blocked' to match the rest of the
hard-error terminal branches in run.ts, but the test title still said
'abandoned liveness state', which made failures and test output
misleading. Rename the test title to match the asserted value. No
code change beyond the it(...) title.
Validation: pnpm test src/agents/pi-embedded-runner/run.incomplete-turn.test.ts
(19/19 pass).
Refs #64227
* agents: widen strict-agentic auto-activation to handle prefixed and variant GPT-5 model ids
* Align strict-agentic retry matching
* runtime: harden strict-agentic model matching
---------
Co-authored-by: Eva <eva@100yen.org>