The voice-call plugin doc was 664 lines with a flat install/setup walkthrough, three flat 'Realtime' / 'Streaming' / 'TTS' provider config blocks each shown twice, an italicised webhook-security section in Title Case, and a duplicate-Voice Call body H1. Restructure for scan-first reading without losing operational detail: - Wrap Quick start in a Steps component (install -> configure -> verify -> smoke), with the 'install from npm' vs 'install from local folder' choice as a nested Tabs. - Surface the public-webhook-URL constraint as a Warning at the top of Quick start so readers see it before they hit setup. - Move provider exposure caveats, streaming connection caps, and legacy config migration notes into a single AccordionGroup so the Configuration section reads as the canonical config plus collapsible operational details. - Convert the Realtime, Streaming, and TTS provider examples to Tabs with one tab per provider (Google/OpenAI for realtime; OpenAI/xAI for streaming; Core/ElevenLabs/OpenAI override for TTS), removing the previous duplicate-block-per-provider pattern. - Convert the realtime tool-policy bullet list to a 3-row table. - Convert the agent tool action list and gateway RPC list into small tables (action -> args). - Surface inboundPolicy caller-ID weakness, microsoft-not-supported for telephony, and realtime+streaming exclusivity as Warning callouts where they were previously buried inline. - Sentence-case 'Webhook security' (was Title Case), drop the duplicate body H1, and refresh the Related list to alphabetical sentence-case. Provider names, env vars, defaults, models, voice ids, command flags, and field semantics are unchanged. Pure restructure plus Mintlify component upgrades.
21 KiB
summary, read_when, title, sidebarTitle
| summary | read_when | title | sidebarTitle | |||
|---|---|---|---|---|---|---|
| Place outbound and accept inbound voice calls via Twilio, Telnyx, or Plivo, with optional realtime voice and streaming transcription |
|
Voice call plugin | Voice call |
Voice calls for OpenClaw via a plugin. Supports outbound notifications, multi-turn conversations, full-duplex realtime voice, streaming transcription, and inbound calls with allowlist policies.
Current providers: twilio (Programmable Voice + Media Streams),
telnyx (Call Control v2), plivo (Voice API + XML transfer + GetInput
speech), mock (dev/no network).
Quick start
```bash openclaw plugins install @openclaw/voice-call ``` ```bash PLUGIN_SRC=./path/to/local/voice-call-plugin openclaw plugins install "$PLUGIN_SRC" cd "$PLUGIN_SRC" && pnpm install ```Restart the Gateway afterwards so the plugin loads.
Set config under `plugins.entries.voice-call.config` (see
[Configuration](#configuration) below for the full shape). At minimum:
`provider`, provider credentials, `fromNumber`, and a publicly
reachable webhook URL.
```bash
openclaw voicecall setup
```
The default output is readable in chat logs and terminals. It checks
plugin enablement, provider credentials, webhook exposure, and that
only one audio mode (`streaming` or `realtime`) is active. Use
`--json` for scripts.
```bash
openclaw voicecall smoke
openclaw voicecall smoke --to "+15555550123"
```
Both are dry runs by default. Add `--yes` to actually place a short
outbound notify call:
```bash
openclaw voicecall smoke --to "+15555550123" --yes
```
For Twilio, Telnyx, and Plivo, setup must resolve to a **public webhook URL**.
If `publicUrl`, the tunnel URL, the Tailscale URL, or the serve fallback
resolves to loopback or private network space, setup fails instead of
starting a provider that cannot receive carrier webhooks.
Configuration
If enabled: true but the selected provider is missing credentials,
Gateway startup logs a setup-incomplete warning with the missing keys and
skips starting the runtime. Commands, RPC calls, and agent tools still
return the exact missing provider configuration when used.
{
plugins: {
entries: {
"voice-call": {
enabled: true,
config: {
provider: "twilio", // or "telnyx" | "plivo" | "mock"
fromNumber: "+15550001234", // or TWILIO_FROM_NUMBER for Twilio
toNumber: "+15550005678",
twilio: {
accountSid: "ACxxxxxxxx",
authToken: "...",
},
telnyx: {
apiKey: "...",
connectionId: "...",
// Telnyx webhook public key from the Mission Control Portal
// (Base64; can also be set via TELNYX_PUBLIC_KEY).
publicKey: "...",
},
plivo: {
authId: "MAxxxxxxxxxxxxxxxxxxxx",
authToken: "...",
},
// Webhook server
serve: {
port: 3334,
path: "/voice/webhook",
},
// Webhook security (recommended for tunnels/proxies)
webhookSecurity: {
allowedHosts: ["voice.example.com"],
trustedProxyIPs: ["100.64.0.1"],
},
// Public exposure (pick one)
// publicUrl: "https://example.ngrok.app/voice/webhook",
// tunnel: { provider: "ngrok" },
// tailscale: { mode: "funnel", path: "/voice/webhook" },
outbound: {
defaultMode: "notify", // notify | conversation
},
streaming: { enabled: true /* see Streaming transcription */ },
realtime: { enabled: false /* see Realtime voice */ },
},
},
},
},
}
Auto-migrated streaming keys:
- `streaming.sttProvider` → `streaming.provider`
- `streaming.openaiApiKey` → `streaming.providers.openai.apiKey`
- `streaming.sttModel` → `streaming.providers.openai.model`
- `streaming.silenceDurationMs` → `streaming.providers.openai.silenceDurationMs`
- `streaming.vadThreshold` → `streaming.providers.openai.vadThreshold`
Realtime voice conversations
realtime selects a full-duplex realtime voice provider for live call
audio. It is separate from streaming, which only forwards audio to
realtime transcription providers.
Current runtime behaviour:
realtime.enabledis supported for Twilio Media Streams.realtime.provideris optional. If unset, Voice Call uses the first registered realtime voice provider.- Bundled realtime voice providers: Google Gemini Live (
google) and OpenAI (openai), registered by their provider plugins. - Provider-owned raw config lives under
realtime.providers.<providerId>. - Voice Call exposes the shared
openclaw_agent_consultrealtime tool by default. The realtime model can call it when the caller asks for deeper reasoning, current information, or normal OpenClaw tools. - If
realtime.providerpoints at an unregistered provider, or no realtime voice provider is registered at all, Voice Call logs a warning and skips realtime media instead of failing the whole plugin. - Consult session keys reuse the existing voice session when available, then fall back to the caller/callee phone number so follow-up consult calls keep context during the call.
Tool policy
realtime.toolPolicy controls the consult run:
| Policy | Behavior |
|---|---|
safe-read-only |
Expose the consult tool and limit the regular agent to read, web_search, web_fetch, x_search, memory_search, and memory_get. |
owner |
Expose the consult tool and let the regular agent use the normal agent tool policy. |
none |
Do not expose the consult tool. Custom realtime.tools are still passed through to the realtime provider. |
Realtime provider examples
Defaults: API key from `realtime.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_GENERATIVE_AI_API_KEY`; model `gemini-2.5-flash-native-audio-preview-12-2025`; voice `Kore`.```json5
{
plugins: {
entries: {
"voice-call": {
config: {
provider: "twilio",
inboundPolicy: "allowlist",
allowFrom: ["+15550005678"],
realtime: {
enabled: true,
provider: "google",
instructions: "Speak briefly. Call openclaw_agent_consult before using deeper tools.",
toolPolicy: "safe-read-only",
providers: {
google: {
apiKey: "${GEMINI_API_KEY}",
model: "gemini-2.5-flash-native-audio-preview-12-2025",
voice: "Kore",
},
},
},
},
},
},
},
}
```
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
realtime: {
enabled: true,
provider: "openai",
providers: {
openai: { apiKey: "${OPENAI_API_KEY}" },
},
},
},
},
},
},
}
```
See Google provider and OpenAI provider for provider-specific realtime voice options.
Streaming transcription
streaming selects a realtime transcription provider for live call audio.
Current runtime behavior:
streaming.provideris optional. If unset, Voice Call uses the first registered realtime transcription provider.- Bundled realtime transcription providers: Deepgram (
deepgram), ElevenLabs (elevenlabs), Mistral (mistral), OpenAI (openai), and xAI (xai), registered by their provider plugins. - Provider-owned raw config lives under
streaming.providers.<providerId>. - If
streaming.providerpoints at an unregistered provider, or none is registered, Voice Call logs a warning and skips media streaming instead of failing the whole plugin.
Streaming provider examples
Defaults: API key `streaming.providers.openai.apiKey` or `OPENAI_API_KEY`; model `gpt-4o-transcribe`; `silenceDurationMs: 800`; `vadThreshold: 0.5`.```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "openai",
streamPath: "/voice/stream",
providers: {
openai: {
apiKey: "sk-...", // optional if OPENAI_API_KEY is set
model: "gpt-4o-transcribe",
silenceDurationMs: 800,
vadThreshold: 0.5,
},
},
},
},
},
},
},
}
```
Defaults: API key `streaming.providers.xai.apiKey` or `XAI_API_KEY`;
endpoint `wss://api.x.ai/v1/stt`; encoding `mulaw`; sample rate `8000`;
`endpointingMs: 800`; `interimResults: true`.
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "xai",
streamPath: "/voice/stream",
providers: {
xai: {
apiKey: "${XAI_API_KEY}", // optional if XAI_API_KEY is set
endpointingMs: 800,
language: "en",
},
},
},
},
},
},
},
}
```
TTS for calls
Voice Call uses the core messages.tts configuration for streaming
speech on calls. You can override it under the plugin config with the
same shape — it deep-merges with messages.tts.
{
tts: {
provider: "elevenlabs",
providers: {
elevenlabs: {
voiceId: "pMsXgVXv3BLzUgSXRplE",
modelId: "eleven_multilingual_v2",
},
},
},
}
Behavior notes:
- Legacy
tts.<provider>keys inside plugin config (openai,elevenlabs,microsoft,edge) are repaired byopenclaw doctor --fix; committed config should usetts.providers.<provider>. - Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider-native voices.
- If a Twilio media stream is already active, Voice Call does not fall back to TwiML
<Say>. If telephony TTS is unavailable in that state, the playback request fails instead of mixing two playback paths. - When telephony TTS falls back to a secondary provider, Voice Call logs a warning with the provider chain (
from,to,attempts) for debugging. - When Twilio barge-in or stream teardown clears the pending TTS queue, queued playback requests settle instead of hanging callers awaiting playback completion.
TTS examples
```json5 { messages: { tts: { provider: "openai", providers: { openai: { voice: "alloy" }, }, }, }, } ``` ```json5 { plugins: { entries: { "voice-call": { config: { tts: { provider: "elevenlabs", providers: { elevenlabs: { apiKey: "elevenlabs_key", voiceId: "pMsXgVXv3BLzUgSXRplE", modelId: "eleven_multilingual_v2", }, }, }, }, }, }, }, } ``` ```json5 { plugins: { entries: { "voice-call": { config: { tts: { providers: { openai: { model: "gpt-4o-mini-tts", voice: "marin", }, }, }, }, }, }, }, } ```Inbound calls
Inbound policy defaults to disabled. To enable inbound calls, set:
{
inboundPolicy: "allowlist",
allowFrom: ["+15550001234"],
inboundGreeting: "Hello! How can I help?",
}
Auto-responses use the agent system. Tune with responseModel,
responseSystemPrompt, and responseTimeoutMs.
Spoken output contract
For auto-responses, Voice Call appends a strict spoken-output contract to the system prompt:
{"spoken":"..."}
Voice Call extracts speech text defensively:
- Ignores payloads marked as reasoning/error content.
- Parses direct JSON, fenced JSON, or inline
"spoken"keys. - Falls back to plain text and removes likely planning/meta lead-in paragraphs.
This keeps spoken playback focused on caller-facing text and avoids leaking planning text into audio.
Conversation startup behavior
For outbound conversation calls, first-message handling is tied to live
playback state:
- Barge-in queue clear and auto-response are suppressed only while the initial greeting is actively speaking.
- If initial playback fails, the call returns to
listeningand the initial message remains queued for retry. - Initial playback for Twilio streaming starts on stream connect without extra delay.
- Barge-in aborts active playback and clears queued-but-not-yet-playing Twilio TTS entries. Cleared entries resolve as skipped, so follow-up response logic can continue without waiting on audio that will never play.
- Realtime voice conversations use the realtime stream's own opening turn. Voice Call does not post a legacy
<Say>TwiML update for that initial message, so outbound<Connect><Stream>sessions stay attached.
Twilio stream disconnect grace
When a Twilio media stream disconnects, Voice Call waits 2000 ms before auto-ending the call:
- If the stream reconnects during that window, auto-end is canceled.
- If no stream re-registers after the grace period, the call is ended to prevent stuck active calls.
Stale call reaper
Use staleCallReaperSeconds to end calls that never receive a terminal
webhook (for example, notify-mode calls that never complete). The default
is 0 (disabled).
Recommended ranges:
- Production:
120–300seconds for notify-style flows. - Keep this value higher than
maxDurationSecondsso normal calls can finish. A good starting point ismaxDurationSeconds + 30–60seconds.
{
plugins: {
entries: {
"voice-call": {
config: {
maxDurationSeconds: 300,
staleCallReaperSeconds: 360,
},
},
},
},
}
Webhook security
When a proxy or tunnel sits in front of the Gateway, the plugin reconstructs the public URL for signature verification. These options control which forwarded headers are trusted:
Allowlist hosts from forwarding headers. Trust forwarded headers without an allowlist. Only trust forwarded headers when the request remote IP matches the list.Additional protections:
- Webhook replay protection is enabled for Twilio and Plivo. Replayed valid webhook requests are acknowledged but skipped for side effects.
- Twilio conversation turns include a per-turn token in
<Gather>callbacks, so stale/replayed speech callbacks cannot satisfy a newer pending transcript turn. - Unauthenticated webhook requests are rejected before body reads when the provider's required signature headers are missing.
- The voice-call webhook uses the shared pre-auth body profile (64 KB / 5 seconds) plus a per-IP in-flight cap before signature verification.
Example with a stable public host:
{
plugins: {
entries: {
"voice-call": {
config: {
publicUrl: "https://voice.example.com/voice/webhook",
webhookSecurity: {
allowedHosts: ["voice.example.com"],
},
},
},
},
},
}
CLI
openclaw voicecall call --to "+15555550123" --message "Hello from OpenClaw"
openclaw voicecall start --to "+15555550123" # alias for call
openclaw voicecall continue --call-id <id> --message "Any questions?"
openclaw voicecall speak --call-id <id> --message "One moment"
openclaw voicecall dtmf --call-id <id> --digits "ww123456#"
openclaw voicecall end --call-id <id>
openclaw voicecall status --call-id <id>
openclaw voicecall tail
openclaw voicecall latency # summarize turn latency from logs
openclaw voicecall expose --mode funnel
latency reads calls.jsonl from the default voice-call storage path.
Use --file <path> to point at a different log and --last <n> to limit
analysis to the last N records (default 200). Output includes p50/p90/p99
for turn latency and listen-wait times.
Agent tool
Tool name: voice_call.
| Action | Args |
|---|---|
initiate_call |
message, to?, mode? |
continue_call |
callId, message |
speak_to_user |
callId, message |
send_dtmf |
callId, digits |
end_call |
callId |
get_status |
callId |
This repo ships a matching skill doc at skills/voice-call/SKILL.md.
Gateway RPC
| Method | Args |
|---|---|
voicecall.initiate |
to?, message, mode? |
voicecall.continue |
callId, message |
voicecall.speak |
callId, message |
voicecall.dtmf |
callId, digits |
voicecall.end |
callId |
voicecall.status |
callId |