pashpashpash b13844732e qa: salvage GPT-5.4 parity proof slice (#65664)
* test(qa): gate parity prose scenarios on real tool calls

Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no
fake progress / fake tool completion') for the two first/second-wave
parity scenarios that can currently pass with a prose-only reply.

Background: the scenario framework already exposes tool-call assertions
via /debug/requests on the mock server (see approval-turn-tool-followthrough
for the pattern). Most parity scenarios use this seam to require a specific
plannedToolName, but source-docs-discovery-report and subagent-handoff
only checked the assistant's prose text, which means a model could fabricate:

- a Worked / Failed / Blocked / Follow-up report without ever calling
  the read tool on the docs / source files the prompt named
- three labeled 'Delegated task', 'Result', 'Evidence' sections without
  ever calling sessions_spawn to delegate

Both gaps are fake-progress loopholes for the parity gate.

Changes:

- source-docs-discovery-report: require at least one read tool call tied
  to the 'worked, failed, blocked' prompt in /debug/requests. Failure
  message dumps the observed plannedToolName list for debugging.
- subagent-handoff: require at least one sessions_spawn tool call tied
  to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same
  debug-friendly failure message.

Both assertions are gated behind !env.mock so they no-op in live-frontier
mode where the real provider exposes plannedToolName through a different
channel (or not at all).

Not touched: memory-recall is also in the parity pack but its pass path
is legitimately 'read the fact from prior-turn context'. That is a valid
recall strategy, not fake progress, so it is out of scope for this PR.
memory-recall's fake-progress story (no real memory_search call) would
require bigger mock-server changes and belongs in a follow-up that
extends the mock memory pipeline.

Validation:

- pnpm test extensions/qa-lab/src/scenario-catalog.test.ts

Refs #64227

* test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch

Addresses loop-6 review feedback on PR #64681:

1. Copilot / Greptile / codex-connector all flagged that the discovery
   scenario's .includes('worked, failed, blocked') assertion is
   case-sensitive but the real prompt says 'Worked, Failed, Blocked...',
   so the mock-mode assertion never matches. Fix: lowercase-normalize
   allInputText before the contains check.
2. Greptile P2: the expr and message.expr each called fetchJson
   separately, incurring two round-trips to /debug/requests. Fix: hoist
   the fetch to a set step (discoveryDebugRequests / subagentDebugRequests)
   and reuse the snapshot.
3. Copilot: the subagent-handoff assertion scanned the entire request
   log and matched the first request with 'delegate' in its input text,
   which could false-pass on a stale prior scenario. Fix: reverse the
   array and take the most recent matching request instead.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests

Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I
used on pass 1 usually lands on the FOLLOW-UP request after the mock
runs sessions_spawn, not the pre-tool planning request that actually
has plannedToolName === 'sessions_spawn'. The mock only plans that tool
on requests with !toolOutput (mock-openai-server.ts:662), so the
post-tool request has plannedToolName unset and the assertion fails
even when the handoff succeeded.

Fix: switch the assertion back to a forward .some() match but add a
!request.toolOutput filter so the match is pinned to the pre-tool
planning phase. The case-insensitive regex, the fetchJson dedupe, and
the failure-message diagnostic from pass 1 are unchanged.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): pin subagent-handoff tool-call assertion to scenario prompt

Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix
filtered to pre-tool requests but still used a broad
`/delegate|subagent handoff/i` regex. The `subagent-fanout-synthesis`
scenario runs BEFORE `subagent-handoff` in catalog order (scenarios
are sorted by path), and the fanout prompt reads
'Subagent fanout synthesis check: delegate exactly two bounded
subagents sequentially' — which contains 'delegate' and also plans
sessions_spawn pre-tool. That produces a cross-scenario false pass
where the fanout's earlier sessions_spawn request satisfies the
handoff assertion even when the handoff run never delegates.

Fix: tighten the input-text match from `/delegate|subagent handoff/i`
to `/delegate one bounded qa task/i`, which is the exact scenario-
unique substring from the `subagent-handoff` config.prompt. That
pins the assertion to this scenario's request window and closes the
cross-scenario false positive.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): align parity assertion comments with actual filter logic

Addresses two loop-7 Copilot findings on PR #64681:

1. source-docs-discovery-report.md: the explanatory comment said the
   debug request log was 'lowercased for case-insensitive matching',
   but the code actually lowercases each request's allInputText inline
   inside the .some() predicate, not the discoveryDebugRequests
   snapshot. Rewrite the comment to describe the inline-lowercase
   pattern so a future reader matches the code they see.

2. subagent-handoff.md: the comment said the assertion 'must be
   pinned to THIS scenario's request window' but the implementation
   actually relies on matching a scenario-unique prompt substring
   (/delegate one bounded qa task/i), not a request-window. Rewrite
   the comment to describe the substring pinning and keep the
   pre-tool filter rationale intact.

No runtime change; comment-only fix to keep reviewer expectations
aligned with the actual assertion shape.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios

* Guard mock-only image parity assertions

* Expand agentic parity second wave

* test(qa): pad parity suspicious-pass isolation to second wave

* qa-lab: parametrize parity report title and drop stale first-wave comment

Addresses two loop-7 Copilot findings on PR #64662:

1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a
   template string that interpolates candidateLabel and baselineLabel, so
   any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate
   title in saved reports. Default CLI flags still produce
   openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair.

2. Stale 'declared first-wave parity scenarios' comment in
   scopeSummaryToParityPack: the parity pack is now the ten-scenario
   first-wave+second-wave set (PR D + PR E). Comment updated to drop
   the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS
   constant the scope is filtering against.

New regression: 'parametrizes the markdown header from the comparison
labels' — asserts that non-default labels (openai/gpt-5.4-alt vs
openai/gpt-5.4) render in the H1.

Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts
(13/13 pass).

Refs #64227

* qa-lab: fail parity gate on required scenario failures regardless of baseline parity

* test(qa): update readable-report test to cover all 10 parity scenarios

* qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels

* Tighten parity label and scenario checks

* fix: tighten parity label provenance checks

* fix: scope parity tool-call metrics to tool lanes

* Fix parity report label and fake-success checks

* fix(qa): tighten parity report edge cases

* qa-lab: add Anthropic /v1/messages mock route for parity baseline

Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity
completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats
Opus 4.6 on the agreed metrics').

Background: the parity gate needs two comparable scenario runs - one
against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the
aggregate metrics and verdict in PR D (#64441) can be computed. Today the
qa-lab mock server only implements /v1/responses, so the baseline run
against Claude Opus 4.6 requires a real Anthropic API key. That makes the
gate impossible to prove end-to-end from a local worktree and means the
CI story is always 'two real providers + quota + keys'.

This PR adds a /v1/messages Anthropic-compatible route to the existing
mock OpenAI server. The route is a thin adapter that:

- Parses Anthropic Messages API request shapes (system as string or
  [{type:text,text}], messages with string or block content, text and
  tool_result and tool_use and image blocks)
- Translates them into the ResponsesInputItem[] shape the existing shared
  scenario dispatcher (buildResponsesPayload) already understands
- Calls the shared dispatcher so both the OpenAI and Anthropic lanes run
  through the exact same scenario prompt-matching logic (same subagent
  fanout state machine, same extractRememberedFact helper, same
  '/debug/requests' telemetry)
- Converts the resulting OpenAI-format events back into an Anthropic
  message response with text and tool_use content blocks and a correct
  stop_reason (tool_use vs end_turn)

Non-streaming only: the QA suite runner falls back to non-streaming mock
mode so real Anthropic SSE isn't necessary for the parity baseline.

Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline
model-list probes from the suite runner resolve without extra config.

Tests added:

- advertises Anthropic claude-opus-4-6 baseline model on /v1/models
- dispatches an Anthropic /v1/messages read tool call for source discovery
  prompts (tool_use stop_reason, correct input path, /debug/requests
  records plannedToolName=read)
- dispatches Anthropic /v1/messages tool_result follow-ups through the
  shared scenario logic (subagent-handoff two-stage flow: tool_use -
  tool_result - 'Delegated task / Evidence' prose summary)

Local validation:

- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass)
- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass)

Refs #64227
Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper
by giving the baseline lane a local-only mock path.

* qa-lab: fix Anthropic tool_result ordering in messages adapter

Addresses the loop-6 Copilot / Greptile finding on PR #64685: in
`convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were
pushed to `items` inside the per-block loop while the surrounding
user/assistant message was only pushed after the loop finished. That
reordered the function_call_output BEFORE its parent user message
whenever a user turn mixed `tool_result` with fresh text/image blocks,
which broke `extractToolOutput` (it scans AFTER the last user-role
index; function_call_output placed BEFORE that index is invisible to it)
and made the downstream scenario dispatcher behave as if no tool output
had been returned on mixed-content turns.

Fix: buffer `tool_result` and `tool_use` blocks in local arrays during
the per-block loop, push the parent role message first (when it has any
text/image pieces), then push the accumulated function_call /
function_call_output items in original order. tool_result-only user
turns still omit the parent message as before, so the non-mixed
subagent-fanout-synthesis two-stage flow that already worked keeps
working.

Regression added:

- `places tool_result after the parent user message even in mixed-content
  turns` — sends a user turn that mixes a `tool_result` block with a
  trailing fresh text block, then inspects `/debug/last-request` to
  assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found
  the function_call_output AFTER the last user index) and
  `prompt === 'Keep going with the fanout.'` (extractLastUserText picked
  up the trailing fresh text).

Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts
(19/19 pass).

Refs #64227

* qa-lab: reject Anthropic streaming and empty model in messages mock

* qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider

* Handle invalid Anthropic mock JSON

* fix: wire mock parity providers by model ref

* fix(qa): support Anthropic message streaming in mock parity lane

* qa-lab: record provider/model/mode in qa-suite-summary.json

Closes the 'summary cannot be label-verified' half of criterion 5 on the
GPT-5.4 parity completion gate in #64227.

Background: the parity gate in #64441 compares two qa-suite-summary.json
files and trusts whatever candidateLabel / baselineLabel the caller
passes. Today the summary JSON only contains { scenarios, counts }, so
nothing in the summary records which provider/model the run actually
used. If a maintainer swaps candidate and baseline summary paths in a
parity-report call, the verdict is silently mislabeled and nobody can
retroactively verify which run produced which summary.

Changes:

- Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt,
  providerMode, primaryModel (+ provider and model splits),
  alternateModel (+ provider and model splits), fastMode, concurrency,
  scenarioIds (when explicitly filtered).
- Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary
  JSON shape is unit-testable and the parity gate (and any future parity
  wrapper) can import the exact same type rather than reverse-engineering
  the JSON shape at runtime.
- Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so
  --scenario-ids flags are recorded in the summary.

Unit tests added (src/suite.summary-json.test.ts, 5 cases):

- records provider/model/mode so parity gates can verify labels
- includes scenarioIds in run metadata when provided
- records an Anthropic baseline lane cleanly for parity runs
- leaves split fields null when a model ref is malformed
- keeps scenarios and counts alongside the run metadata

This is additive: existing callers of qa-suite-summary.json continue to
see the same { scenarios, counts } shape, just with an extra run field.
No existing consumers of the JSON need to change.

The follow-up 'qa parity run' CLI wrapper (run the parity pack twice
against candidate + baseline, emit two labeled summaries in one command)
stacks cleanly on top of this change and will land as a separate PR
once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand
directly.

Local validation:

- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass)
- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass)

Refs #64227
Unblocks the final parity run for #64441 / #64662 by making summaries
self-describing.

* qa-lab: strengthen qa-suite-summary builder types and empty-array semantics

Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689
(re-opened as #64789):

1. P2 codex + Copilot: empty `scenarioIds` array was serialized as
   `[]` because of a truthiness check. The CLI passes an empty array
   when --scenario is omitted, so full-suite runs would incorrectly
   record an explicit empty selection. Fix: switch to a
   `length > 0` check so '[] or undefined' both encode as `null`
   in the summary run metadata.

2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate
   consumers but its return type was `Record<string, unknown>`, which
   defeated the point of exporting it. Fix: introduce a concrete
   `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and
   make the builder return it. Downstream code (parity gate, parity
   run wrapper) can now import the type and keep consumers
   type-checked.

3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the
   `'mock-openai' | 'live-frontier'` string union even though
   `QaProviderMode` is already imported from model-selection.ts. Fix:
   reuse `QaProviderMode` so provider-mode additions flow through
   both types at once.

4. Copilot: test fixtures omitted `steps` from the fake scenario
   results, creating shape drift with the real suite scenario-result
   shape. Fix: pad the test fixtures with `steps: []` and tighten the
   scenarioIds assertion to read `json.run.scenarioIds` directly (the
   new concrete return type makes the type-cast unnecessary).

New regression: `treats an empty scenarioIds array as unspecified
(no filter)` — passes `scenarioIds: []` and asserts the summary
records `scenarioIds: null`.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: record executed scenarioIds in summary run metadata

Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689):
`run.scenarioIds` was copied from the raw `params.scenarioIds`
caller input, but `runQaSuite` normalizes that input through
`selectQaSuiteScenarios` which dedupes via `Set` and reorders the
selection to catalog order. When callers repeat --scenario ids or
pass them in non-catalog order, the summary metadata drifted from
the scenarios actually executed, which can make parity/report
tooling treat equivalent runs as different or trust inaccurate
provenance.

Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass
`selectedCatalogScenarios.map(scenario => scenario.id)` instead of
`params?.scenarioIds`, so the summary records the post-selection
executed list. This also covers the full-suite case automatically
(the executed list is the full lane-filtered catalog), giving parity
consumers a stable record of exactly which scenarios landed in the
run regardless of how the caller phrased the request.

buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2
semantics are preserved so the public helper still treats an empty
array as 'unspecified' for any future caller that legitimately passes
one.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: preserve null scenarioIds for unfiltered suite runs

Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix
always passed `selectedCatalogScenarios.map(...)` to
writeQaSuiteArtifacts, which made unfiltered full-suite runs
indistinguishable from an explicit all-scenarios selection in the
summary metadata. The 'unfiltered → null' semantic (documented in
the buildQaSuiteSummaryJson JSDoc and exercised by the
"treats an empty scenarioIds array as unspecified" regression) was
lost.

Fix: both writeQaSuiteArtifacts call sites now condition on the
caller's original `params.scenarioIds`. When the caller passed an
explicit non-empty filter, record the post-selection executed list
(pass-3 behavior, preserving Set-dedupe + catalog-order
normalization). When the caller passed undefined or an empty array,
pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's
length-check serializes null (pass-2 behavior, preserving unfiltered
semantics).

This keeps both codex-connector findings satisfied simultaneously:
- explicit --scenario filter reorders/dedupes through the executed
  list, not the raw caller input
- unfiltered full-suite run records null, not a full catalog dump
  that would shadow "explicit all-scenarios" selections

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type

* qa-lab: stage mock auth profiles so the parity gate runs without real credentials

* fix(qa): clean up mock auth staging follow-ups

* ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock

* ci: use supported parity gate runner label

* ci: watch gateway changes in parity gate

* docs: pin parity runbook alternate models

* fix(ci): watch qa-channel parity inputs

* qa: roll up parity proof closeout

* qa: harden mock parity review fixes

* qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection

* qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch

* qa-lab: clean up positive-tone comment + fix stale test expectations

* qa: pin workflow Node version to 22.14.0 + fix stale label-match wording

* qa-lab: refresh mock provider routing expectation

* docs: drop stale parity rollup rewrite from proof slice

* qa: run parity gate against mock lane

* deps: sync qa-lab lockfile

* build: refresh a2ui bundle hash

* ci: widen parity gate triggers

---------

Co-authored-by: Eva <eva@100yen.org>
2026-04-13 13:01:54 +09:00
2026-04-12 11:28:43 -07:00
2026-04-10 23:09:37 +01:00
2026-04-11 02:15:21 +01:00
2026-04-12 10:37:18 -07:00
2026-04-10 15:56:56 +01:00

🦞 OpenClaw — Personal AI Assistant

OpenClaw

EXFOLIATE! EXFOLIATE!

CI status GitHub release Discord MIT License

OpenClaw is a personal AI assistant you run on your own devices. It answers you on the channels you already use (WhatsApp, Telegram, Slack, Discord, Google Chat, Signal, iMessage, BlueBubbles, IRC, Microsoft Teams, Matrix, Feishu, LINE, Mattermost, Nextcloud Talk, Nostr, Synology Chat, Tlon, Twitch, Zalo, Zalo Personal, WeChat, WebChat). It can speak and listen on macOS/iOS/Android, and can render a live Canvas you control. The Gateway is just the control plane — the product is the assistant.

If you want a personal, single-user assistant that feels local, fast, and always-on, this is it.

Website · Docs · Vision · DeepWiki · Getting Started · Updating · Showcase · FAQ · Onboarding · Nix · Docker · Discord

Preferred setup: run openclaw onboard in your terminal. OpenClaw Onboard guides you step by step through setting up the gateway, workspace, channels, and skills. It is the recommended CLI setup path and works on macOS, Linux, and Windows (via WSL2; strongly recommended). Works with npm, pnpm, or bun. New install? Start here: Getting started

Sponsors

OpenAI GitHub NVIDIA Vercel Blacksmith Convex

Subscriptions (OAuth):

Model note: while many providers and models are supported, prefer a current flagship model from the provider you trust and already use. See Onboarding.

Models (selection + auth)

Runtime: Node 24 (recommended) or Node 22.16+.

npm install -g openclaw@latest
# or: pnpm add -g openclaw@latest

openclaw onboard --install-daemon

OpenClaw Onboard installs the Gateway daemon (launchd/systemd user service) so it stays running.

Quick start (TL;DR)

Runtime: Node 24 (recommended) or Node 22.16+.

Full beginner guide (auth, pairing, channels): Getting started

openclaw onboard --install-daemon

openclaw gateway --port 18789 --verbose

# Send a message
openclaw message send --to +1234567890 --message "Hello from OpenClaw"

# Talk to the assistant (optionally deliver back to any connected channel: WhatsApp/Telegram/Slack/Discord/Google Chat/Signal/iMessage/BlueBubbles/IRC/Microsoft Teams/Matrix/Feishu/LINE/Mattermost/Nextcloud Talk/Nostr/Synology Chat/Tlon/Twitch/Zalo/Zalo Personal/WeChat/WebChat)
openclaw agent --message "Ship checklist" --thinking high

Upgrading? Updating guide (and run openclaw doctor).

Development channels

  • stable: tagged releases (vYYYY.M.D or vYYYY.M.D-<patch>), npm dist-tag latest.
  • beta: prerelease tags (vYYYY.M.D-beta.N), npm dist-tag beta (macOS app may be missing).
  • dev: moving head of main, npm dist-tag dev (when published).

Switch channels (git + npm): openclaw update --channel stable|beta|dev. Details: Development channels.

From source (development)

Prefer pnpm for builds from source. Bun is optional for running TypeScript directly.

git clone https://github.com/openclaw/openclaw.git
cd openclaw

pnpm install
pnpm ui:build # auto-installs UI deps on first run
pnpm build

pnpm openclaw onboard --install-daemon

# Dev loop (auto-reload on source/config changes)
pnpm gateway:watch

Note: pnpm openclaw ... runs TypeScript directly (via tsx). pnpm build produces dist/ for running via Node / the packaged openclaw binary.

Security defaults (DM access)

OpenClaw connects to real messaging surfaces. Treat inbound DMs as untrusted input.

Full security guide: Security

Default behavior on Telegram/WhatsApp/Signal/iMessage/Microsoft Teams/Discord/Google Chat/Slack:

  • DM pairing (dmPolicy="pairing" / channels.discord.dmPolicy="pairing" / channels.slack.dmPolicy="pairing"; legacy: channels.discord.dm.policy, channels.slack.dm.policy): unknown senders receive a short pairing code and the bot does not process their message.
  • Approve with: openclaw pairing approve <channel> <code> (then the sender is added to a local allowlist store).
  • Public inbound DMs require an explicit opt-in: set dmPolicy="open" and include "*" in the channel allowlist (allowFrom / channels.discord.allowFrom / channels.slack.allowFrom; legacy: channels.discord.dm.allowFrom, channels.slack.dm.allowFrom).

Run openclaw doctor to surface risky/misconfigured DM policies.

Highlights

  • Local-first Gateway — single control plane for sessions, channels, tools, and events.
  • Multi-channel inbox — WhatsApp, Telegram, Slack, Discord, Google Chat, Signal, BlueBubbles (iMessage), iMessage (legacy), IRC, Microsoft Teams, Matrix, Feishu, LINE, Mattermost, Nextcloud Talk, Nostr, Synology Chat, Tlon, Twitch, Zalo, Zalo Personal, WeChat, WebChat, macOS, iOS/Android.
  • Multi-agent routing — route inbound channels/accounts/peers to isolated agents (workspaces + per-agent sessions).
  • Voice Wake + Talk Mode — wake words on macOS/iOS and continuous voice on Android (ElevenLabs + system TTS fallback).
  • Live Canvas — agent-driven visual workspace with A2UI.
  • First-class tools — browser, canvas, nodes, cron, sessions, and Discord/Slack actions.
  • Companion apps — macOS menu bar app + iOS/Android nodes.
  • Onboarding + skills — onboarding-driven setup with bundled/managed/workspace skills.

Star History

Star History Chart

Everything we built so far

Core platform

Channels

Apps + nodes

Tools + automation

Runtime + safety

Ops + packaging

How it works (short)

WhatsApp / Telegram / Slack / Discord / Google Chat / Signal / iMessage / BlueBubbles / IRC / Microsoft Teams / Matrix / Feishu / LINE / Mattermost / Nextcloud Talk / Nostr / Synology Chat / Tlon / Twitch / Zalo / Zalo Personal / WeChat / WebChat
               │
               ▼
┌───────────────────────────────┐
│            Gateway            │
│       (control plane)         │
│     ws://127.0.0.1:18789      │
└──────────────┬────────────────┘
               │
               ├─ Pi agent (RPC)
               ├─ CLI (openclaw …)
               ├─ WebChat UI
               ├─ macOS app
               └─ iOS / Android nodes

Key subsystems

Tailscale access (Gateway dashboard)

OpenClaw can auto-configure Tailscale Serve (tailnet-only) or Funnel (public) while the Gateway stays bound to loopback. Configure gateway.tailscale.mode:

  • off: no Tailscale automation (default).
  • serve: tailnet-only HTTPS via tailscale serve (uses Tailscale identity headers by default).
  • funnel: public HTTPS via tailscale funnel (requires shared password auth).

Notes:

  • gateway.bind must stay loopback when Serve/Funnel is enabled (OpenClaw enforces this).
  • Serve can be forced to require a password by setting gateway.auth.mode: "password" or gateway.auth.allowTailscale: false.
  • Funnel refuses to start unless gateway.auth.mode: "password" is set.
  • Optional: gateway.tailscale.resetOnExit to undo Serve/Funnel on shutdown.

Details: Tailscale guide · Web surfaces

Remote Gateway (Linux is great)

Its perfectly fine to run the Gateway on a small Linux instance. Clients (macOS app, CLI, WebChat) can connect over Tailscale Serve/Funnel or SSH tunnels, and you can still pair device nodes (macOS/iOS/Android) to execute devicelocal actions when needed.

  • Gateway host runs the exec tool and channel connections by default.
  • Device nodes run devicelocal actions (system.run, camera, screen recording, notifications) via node.invoke. In short: exec runs where the Gateway lives; device actions run where the device lives.

Details: Remote access · Nodes · Security

macOS permissions via the Gateway protocol

The macOS app can run in node mode and advertises its capabilities + permission map over the Gateway WebSocket (node.list / node.describe). Clients can then execute local actions via node.invoke:

  • system.run runs a local command and returns stdout/stderr/exit code; set needsScreenRecording: true to require screen-recording permission (otherwise youll get PERMISSION_MISSING).
  • system.notify posts a user notification and fails if notifications are denied.
  • canvas.*, camera.*, screen.record, and location.get are also routed via node.invoke and follow TCC permission status.

Elevated bash (host permissions) is separate from macOS TCC:

  • Use /elevated on|off to toggle persession elevated access when enabled + allowlisted.
  • Gateway persists the persession toggle via sessions.patch (WS method) alongside thinkingLevel, verboseLevel, model, sendPolicy, and groupActivation.

Details: Nodes · macOS app · Gateway protocol

Agent to Agent (sessions_* tools)

  • Use these to coordinate work across sessions without jumping between chat surfaces.
  • sessions_list — discover active sessions (agents) and their metadata.
  • sessions_history — fetch transcript logs for a session.
  • sessions_send — message another session; optional replyback pingpong + announce step (REPLY_SKIP, ANNOUNCE_SKIP).

Details: Session tools

Skills registry (ClawHub)

ClawHub is a minimal skill registry. With ClawHub enabled, the agent can search for skills automatically and pull in new ones as needed.

ClawHub

Chat commands

Send these in WhatsApp/Telegram/Slack/Google Chat/Microsoft Teams/WebChat (group commands are owner-only):

  • /status — compact session status (model + tokens, cost when available)
  • /new or /reset — reset the session
  • /compact — compact session context (summary)
  • /think <level> — off|minimal|low|medium|high|xhigh (GPT-5.2 + Codex models only)
  • /verbose on|off
  • /trace on|off — plugin trace/debug lines only
  • /usage off|tokens|full — per-response usage footer
  • /restart — restart the gateway (owner-only in groups)
  • /activation mention|always — group activation toggle (groups only)

Apps (optional)

The Gateway alone delivers a great experience. All apps are optional and add extra features.

If you plan to build/run companion apps, follow the platform runbooks below.

macOS (OpenClaw.app) (optional)

  • Menu bar control for the Gateway and health.
  • Voice Wake + push-to-talk overlay.
  • WebChat + debug tools.
  • Remote gateway control over SSH.

Note: signed builds required for macOS permissions to stick across rebuilds (see macOS Permissions).

iOS node (optional)

  • Pairs as a node over the Gateway WebSocket (device pairing).
  • Voice trigger forwarding + Canvas surface.
  • Controlled via openclaw nodes ….

Runbook: iOS connect.

Android node (optional)

  • Pairs as a WS node via device pairing (openclaw devices ...).
  • Exposes Connect/Chat/Voice tabs plus Canvas, Camera, Screen capture, and Android device command families.
  • Runbook: Android connect.

Agent workspace + skills

  • Workspace root: ~/.openclaw/workspace (configurable via agents.defaults.workspace).
  • Injected prompt files: AGENTS.md, SOUL.md, TOOLS.md.
  • Skills: ~/.openclaw/workspace/skills/<skill>/SKILL.md.

Configuration

Minimal ~/.openclaw/openclaw.json (model + defaults):

{
  agent: {
    model: "<provider>/<model-id>",
  },
}

Full configuration reference (all keys + examples).

Security model (important)

  • Default: tools run on the host for the main session, so the agent has full access when its just you.
  • Group/channel safety: set agents.defaults.sandbox.mode: "non-main" to run nonmain sessions (groups/channels) inside persession Docker sandboxes; bash then runs in Docker for those sessions.
  • Sandbox defaults: allowlist bash, process, read, write, edit, sessions_list, sessions_history, sessions_send, sessions_spawn; denylist browser, canvas, nodes, cron, discord, gateway.

Details: Security guide · Docker + sandboxing · Sandbox config

WhatsApp

  • Link the device: pnpm openclaw channels login (stores creds in ~/.openclaw/credentials).
  • Allowlist who can talk to the assistant via channels.whatsapp.allowFrom.
  • If channels.whatsapp.groups is set, it becomes a group allowlist; include "*" to allow all.

Telegram

  • Set TELEGRAM_BOT_TOKEN or channels.telegram.botToken (env wins).
  • Optional: set channels.telegram.groups (with channels.telegram.groups."*".requireMention); when set, it is a group allowlist (include "*" to allow all). Also channels.telegram.allowFrom or channels.telegram.webhookUrl + channels.telegram.webhookSecret as needed.
{
  channels: {
    telegram: {
      botToken: "123456:ABCDEF",
    },
  },
}

Slack

  • Set SLACK_BOT_TOKEN + SLACK_APP_TOKEN (or channels.slack.botToken + channels.slack.appToken).

Discord

  • Set DISCORD_BOT_TOKEN or channels.discord.token.
  • Optional: set commands.native, commands.text, or commands.useAccessGroups, plus channels.discord.allowFrom, channels.discord.guilds, or channels.discord.mediaMaxMb as needed.
{
  channels: {
    discord: {
      token: "1234abcd",
    },
  },
}

Signal

  • Requires signal-cli and a channels.signal config section.

BlueBubbles (iMessage)

  • Recommended iMessage integration.
  • Configure channels.bluebubbles.serverUrl + channels.bluebubbles.password and a webhook (channels.bluebubbles.webhookPath).
  • The BlueBubbles server runs on macOS; the Gateway can run on macOS or elsewhere.

iMessage (legacy)

  • Legacy macOS-only integration via imsg (Messages must be signed in).
  • If channels.imessage.groups is set, it becomes a group allowlist; include "*" to allow all.

Microsoft Teams

  • Configure a Teams app + Bot Framework, then add a msteams config section.
  • Allowlist who can talk via msteams.allowFrom; group access via msteams.groupAllowFrom or msteams.groupPolicy: "open".

WeChat

  • Official Tencent plugin via @tencent-weixin/openclaw-weixin (iLink Bot API). Private chats only; v2.x requires OpenClaw >=2026.3.22.
  • Install: openclaw plugins install "@tencent-weixin/openclaw-weixin", then openclaw channels login --channel openclaw-weixin to scan the QR code.
  • Requires the WeChat ClawBot plugin (WeChat > Me > Settings > Plugins); gradual rollout by Tencent.

WebChat

  • Uses the Gateway WebSocket; no separate WebChat port/config.

Browser control (optional):

{
  browser: {
    enabled: true,
    color: "#FF4500",
  },
}

Docs

Use these when youre past the onboarding flow and want the deeper reference.

Advanced docs (discovery + control)

Operations & troubleshooting

Deep dives

Workspace & skills

Platform internals

Email hooks (Gmail)

Molty

OpenClaw was built for Molty, a space lobster AI assistant. 🦞 by Peter Steinberger and the community.

Community

See CONTRIBUTING.md for guidelines, maintainers, and how to submit PRs. AI/vibe-coded PRs welcome! 🤖

Special thanks to Mario Zechner for his support and for pi-mono. Special thanks to Adam Doppelt for lobster.bot.

Thanks to all clawtributors:

steipete vincentkoc vignesh07 obviyus Mariano Belinky sebslight gumadeiras Takhoffman thewilloftheshadow cpojer tyler6204 joshp123 Glucksberg mcaxtr quotentiroler osolmaz Sid-Qin joshavant shakkernerd bmendonca3 mukhtharcm zerone0x mcinteerj ngutman lailoo arosstale rodrigouroz robbyczgw-cla Elonito Clawborn yinghaosang BunsDev christianklotz echoVic coygeek roshanasingh4 mneves75 joaohlisboa bohdanpodvirnyi nachx639 onutc Verite Igiraneza widingmarcus-cyber akramcodez aether-ai-agent bjesuiter MaudeBot YuriNachos chilu18 byungsker dbhurley JayMishra-source iHildy mudrii dlauer Solvely-Colin czekaj advaitpaliwal lc0rp grp06 HenryLoenwind azade-c Lukavyi vrknetha brandonwise conroywhitney Tobias Bischoff davidrudduck xinhuagu jaydenfyi petter-b heyhudson MatthieuBizien huntharo omair445 adam91holt adhitShet smartprogrammer93 radek-paclt frankekn bradleypriest rahthakor shadril238 VACInc juanpablodlc jonisjongithub magimetal stakeswky abhisekbasu1 MisterGuy420 hsrvc nabbilkhan aldoeliacim jamesgroat orlyjamie Elarwei001 rubyrunsstuff Phineas1500 meaningfool sfo2001 Marvae liuy shtse8 thebenignhacker carrotRakko ranausmanai kevinWangSheng gregmousseau rrenamed akoscz jarvis-medmatic danielz1z pandego xadenryan NicholasSpisak graysurf gupsammy nyanjou sibbl gejifeng ide-rea leszekszpunar Yida-Dev AI-Reviewer-QS SocialNerd42069 maxsumrall hougangdev Minidoracat AnonO6 sreekaransrinath YuzuruS riccardogiorato Bridgerz Mrseenz buddyh Eng. Juan Combetto peschee cash-echo-bot jalehman zknicker Harald Buerbaumer taw0002 scald openperf BUGKillerKing Oceanswave Hiren Patel kiranjd antons dan-dr jadilson12 sumleo Whoaa512 luijoc niceysam JustYannicc emanuelst TsekaLuk JustasM loiie45e davidguttman natefikru dougvk koala73 mkbehr zats Simone Macario openclaw-bot ENCHIGO mteam88 Blakeshannon gabriel-trigo neist pejmanjohn durenzidu Ryan Haines hcl XuHao benithors bitfoundry-ai HeMuling markmusson ameno- battman21 BinHPdev dguido evalexpr guirguispierre henrino3 joeykrug loganprit odysseus0 dbachelder Divanoli Mydeen Pitchai liuxiaopai-ai Sam Padilla pvtclawn seheepeak TSavo nachoiacovino misterdas LeftX badlogic Shuai-DaiDai mousberg Masataka Shinohara BillChirico Lewis solstead julianengel dantelex sahilsatralkar kkarimi mahmoudashraf93 pkrmf ryan-crabbe miloudbelarebia Mars El-Fitz McRolly NWANGWU carlulsoe Dithilli emonty fal3 mitschabaude-bot benostein LI SHANXIN magendary mahanandhi CashWilliams j2h4u bsormagec Jessy LANGE Lalit Singh hyf0-agent andranik-sahakyan unisone jeann2013 jogelin rmorse scz2011 wes-davis popomore cathrynlavery iamadig Vasanth Rao Naik Sabavat Jay Caldwell Shailesh Kirill Shchetynin ruypang mitchmcalister Paul van Oorschot Xu Gu Menglin Li artuskg jackheuberger imfing superman32432432 Syhids Marvin Taylor Asplund dakshaymehta Stefan Galescu lploc94 WalterSumbon krizpoon EnzeD Evizero Grynn hydro13 jverdi kentaro kunalk16 longmaba mjrussell optimikelabs oswalpalash RamiNoodle733 sauerdaniel SleuthCo TaKO8Ki travisp rodbland2021 fagemx BigUncle Igor Markelov zhoulc777 connorshea TIHU Tony Dehnke pablohrcarvalho bonald rhuanssauro Tanwa Arpornthip webvijayi Tom Ron ozbillwang Patrick Barletta Ian Derrington austinm911 Ayush10 boris721 damoahdominic doodlewind ikari-pl philipp-spiess shayan919293 Harrington-bot nonggia.liang Michael Lee OscarMinjarez claude Alg0rix Lucky Harry Cui Kepler h0tp-ftw Youyou972 Dominic danielwanwx 0xJonHoldsCrypto akyourowngames clawdinator[bot] erikpr1994 thesash thesomewhatyou dashed Dale Babiy Diaspar4u brianleach codexGW dirbalak Iranb Max TideFinder Chase Dorsey Joly0 adityashaw2 tumf slonce70 alexgleason theonejvo Skyler Miao Jeremiah Lowin peetzweg/ chrisrodz ghsmc ibrahimq21 irtiq7 Jonathan D. Rhyne (DJ-D) kelvinCB mitsuhiko rybnikov santiagomed suminhthanh svkozak kaizen403 sleontenko Nate CornBrother0x DukeDeSouth crimeacs Cklee Garnet Liu neverland ryan sircrumpet AdeboyeDN Neo asklee-klawd benediktjohannes 张哲芳 constansino Yuting Lin OfflynAI Rajat Joshi Daniel Zou Manik Vahsith ProspectOre Lilo 24601 awkoy dawondyifraw google-labs-jules[bot] hyojin Kansodata natedenh pi0 dddabtc AkashKobal wu-tian807 Ganghyun Kim Stephen Brian King tosh-hamburg John Rood JINNYEONG KIM Dinakar Sarbada aj47 Protocol Zero Limitless Mykyta Bozhenko Nicholas Shivam Kumar Raut andreesg Fred White Anandesh-Sharma ysqander ezhikkk andreabadesso BinaryMuse cordx56 DevSecTim edincampara fcatuhe gildo itsjaydesu ivanrvpereira loeclos MarvinCui p6l-richard thejhinvirtuoso yudshj Wangnov Jonathan Works Yassine Amjad Django Navarro Frank Harris Kenny Lee Drake Thomsen wangai-studio AytuncYildizli Charlie Niño Jeremy Mumford Yeom-JinHo Rob Axelsen junwon Pratham Dubey amitbiswal007 Slats Oren Parker Todd Brooks MattQ Milofax Steve (OpenClaw) Matthew Cassius0924 0xbrak 8BlT Abdul535 abhaymundhara aduk059 afurm aisling404 akari-musubi albertlieyingadrian Alex-Alaniz ali-aljufairi altaywtf araa47 Asleep123 avacadobanana352 barronlroth bennewton999 bguidolim bigwest60 caelum0x championswimmer dutifulbob eternauta1337 foeken gittb HeimdallStrategy junsuwhy knocte MackDing nobrainer-tech Noctivoro Raikan10 Swader Alexis Gallagher alexstyl Ethan Palm yingchunbai joshrad-dev Dan Ballance Eric Su Kimitaka Watanabe Justin Ling lutr0 Raymond Berger atalovesyou jayhickey jonasjancarik latitudeki5223 minghinmatthewlam rafaelreis-r ratulsarna timkrase efe-buken manmal easternbloc manuelhettich sktbrd larlyssa Mind-Dragon pcty-nextgen-service-account tmchow uli-will-code Marc Gratch JackyWay aaronveklabs CJWTRUST erik-agens odnxe T5-AndyML Josh Phillips mujiannan Marco Di Dionisio Randy Torres afern247 0oAstro alexanderatallah testingabc321 humanwritten aaronn Alphonse-arianee gtsifrikas hrdwdmrbl hugobarauna jiulingyun kitze loukotal MSch odrobnik reeltimeapps rhjoh ronak-guliani snopoke

Languages
TypeScript 79.8%
JavaScript 12.8%
Swift 4.5%
Kotlin 1.2%
Shell 0.9%
Other 0.7%