Files
openclaw/docs/tools/tts.md
Vincent Koc fbd6b3ce3c docs(tts): A-Z order providers and add tools/tts to Tools nav group
- docs/tools/tts.md: alphabetize providers in three places that listed
  them: the supported-providers table (Azure Speech ... Xiaomi MiMo),
  the configuration Tabs (12 provider presets in A-Z), and the field
  reference AccordionGroup. Top-level fields stay first; provider
  tabs/accordions follow strict alphabetical order. Wording, schema,
  and defaults unchanged.
- docs/docs.json: add tools/tts to the main Tools sidebar group
  (slotted between trajectory and video-generation, matching the
  alphabetical neighborhood with image-generation, music-generation,
  video-generation). Previously tts only appeared under
  Nodes > Media capabilities, which was a discoverability gap for
  readers looking for TTS alongside the other generation tools.
2026-04-25 22:05:46 -07:00

40 KiB
Raw Blame History

summary, read_when, title
summary read_when title
Text-to-speech for outbound replies — providers, personas, slash commands, and per-channel output
Enabling text-to-speech for replies
Configuring a TTS provider, fallback chain, or persona
Using /tts commands or directives
Text-to-speech

OpenClaw can convert outbound replies into audio across 13 speech providers and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp, audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk.

Quick start

OpenAI and ElevenLabs are the most reliable hosted options. Microsoft and Local CLI work without an API key. See the [provider matrix](#supported-providers) for the full list. Export the env var for your provider (for example `OPENAI_API_KEY`, `ELEVENLABS_API_KEY`). Microsoft and Local CLI need no key. Set `messages.tts.auto: "always"` and `messages.tts.provider`:
```json5
{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
    },
  },
}
```
`/tts status` shows the current state. `/tts audio Hello from OpenClaw` sends a one-off audio reply. Auto-TTS is **off** by default. When `messages.tts.provider` is unset, OpenClaw picks the first configured provider in registry auto-select order.

Supported providers

Provider Auth Notes
Azure Speech AZURE_SPEECH_KEY + AZURE_SPEECH_REGION (also AZURE_SPEECH_API_KEY, SPEECH_KEY, SPEECH_REGION) Native Ogg/Opus voice-note output and telephony.
ElevenLabs ELEVENLABS_API_KEY or XI_API_KEY Voice cloning, multilingual, deterministic via seed.
Google Gemini GEMINI_API_KEY or GOOGLE_API_KEY Gemini API TTS; persona-aware via promptTemplate: "audio-profile-v1".
Gradium GRADIUM_API_KEY Voice-note and telephony output.
Inworld INWORLD_API_KEY Streaming TTS API. Native Opus voice-note and PCM telephony.
Local CLI none Runs a configured local TTS command.
Microsoft none Public Edge neural TTS via node-edge-tts. Best-effort, no SLA.
MiniMax MINIMAX_API_KEY (or Token Plan: MINIMAX_OAUTH_TOKEN, MINIMAX_CODE_PLAN_KEY, MINIMAX_CODING_API_KEY) T2A v2 API. Defaults to speech-2.8-hd.
OpenAI OPENAI_API_KEY Also used for auto-summary; supports persona instructions.
OpenRouter OPENROUTER_API_KEY (can reuse models.providers.openrouter.apiKey) Default model hexgrad/kokoro-82m.
Volcengine VOLCENGINE_TTS_API_KEY or BYTEPLUS_SEED_SPEECH_API_KEY (legacy AppID/token: VOLCENGINE_TTS_APPID/_TOKEN) BytePlus Seed Speech HTTP API.
Vydra VYDRA_API_KEY Shared image, video, and speech provider.
xAI XAI_API_KEY xAI batch TTS. Native Opus voice-note is not supported.
Xiaomi MiMo XIAOMI_API_KEY MiMo TTS through Xiaomi chat completions.

If multiple providers are configured, the selected one is used first and the others are fallback options. Auto-summary uses summaryModel (or agents.defaults.model.primary), so that provider must also be authenticated if you keep summaries enabled.

The bundled **Microsoft** provider uses Microsoft Edge's online neural TTS service via `node-edge-tts`. It is a public web service without a published SLA or quota — treat it as best-effort. The legacy provider id `edge` is normalized to `microsoft` and `openclaw doctor --fix` rewrites persisted config; new configs should always use `microsoft`.

Configuration

TTS config lives under messages.tts in ~/.openclaw/openclaw.json. Pick a preset and adapt the provider block:

```json5 { messages: { tts: { auto: "always", provider: "azure-speech", providers: { "azure-speech": { apiKey: "${AZURE_SPEECH_KEY}", region: "eastus", voice: "en-US-JennyNeural", lang: "en-US", outputFormat: "audio-24khz-48kbitrate-mono-mp3", voiceNoteOutputFormat: "ogg-24khz-16bit-mono-opus", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "elevenlabs", providers: { elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2", voiceId: "EXAVITQu4vr4xnSDxMaL", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "google", providers: { google: { apiKey: "${GEMINI_API_KEY}", model: "gemini-3.1-flash-tts-preview", voiceName: "Kore", // Optional natural-language style prompts: // audioProfile: "Speak in a calm, podcast-host tone.", // speakerName: "Alex", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "gradium", providers: { gradium: { apiKey: "${GRADIUM_API_KEY}", voiceId: "YTpq7expH9539ERJ", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "inworld", providers: { inworld: { apiKey: "${INWORLD_API_KEY}", modelId: "inworld-tts-1.5-max", voiceId: "Sarah", temperature: 0.7, }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "tts-local-cli", providers: { "tts-local-cli": { command: "say", args: ["-o", "{{OutputPath}}", "{{Text}}"], outputFormat: "wav", timeoutMs: 120000, }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "microsoft", providers: { microsoft: { enabled: true, voice: "en-US-MichelleNeural", lang: "en-US", outputFormat: "audio-24khz-48kbitrate-mono-mp3", rate: "+0%", pitch: "+0%", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "minimax", providers: { minimax: { apiKey: "${MINIMAX_API_KEY}", model: "speech-2.8-hd", voiceId: "English_expressive_narrator", speed: 1.0, vol: 1.0, pitch: 0, }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "openai", summaryModel: "openai/gpt-4.1-mini", modelOverrides: { enabled: true }, providers: { openai: { apiKey: "${OPENAI_API_KEY}", model: "gpt-4o-mini-tts", voice: "alloy", }, elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2", voiceId: "EXAVITQu4vr4xnSDxMaL", voiceSettings: { stability: 0.5, similarityBoost: 0.75, style: 0.0, useSpeakerBoost: true, speed: 1.0 }, applyTextNormalization: "auto", languageCode: "en", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "openrouter", providers: { openrouter: { apiKey: "${OPENROUTER_API_KEY}", model: "hexgrad/kokoro-82m", voice: "af_alloy", responseFormat: "mp3", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "volcengine", providers: { volcengine: { apiKey: "${VOLCENGINE_TTS_API_KEY}", resourceId: "seed-tts-1.0", voice: "en_female_anna_mars_bigtts", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "xai", providers: { xai: { apiKey: "${XAI_API_KEY}", voiceId: "eve", language: "en", responseFormat: "mp3", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "xiaomi", providers: { xiaomi: { apiKey: "${XIAOMI_API_KEY}", model: "mimo-v2.5-tts", voice: "mimo_default", format: "mp3", }, }, }, }, } ```

Per-agent voice overrides

Use agents.list[].tts when one agent should speak with a different provider, voice, model, persona, or auto-TTS mode. The agent block deep-merges over messages.tts, so provider credentials can stay in the global provider config:

{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
      providers: {
        elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2" },
      },
    },
  },
  agents: {
    list: [
      {
        id: "reader",
        tts: {
          providers: {
            elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL" },
          },
        },
      },
    ],
  },
}

To pin a per-agent persona, set agents.list[].tts.persona alongside provider config — it overrides the global messages.tts.persona for that agent only.

Precedence order for automatic replies, /tts audio, /tts status, and the tts agent tool:

  1. messages.tts
  2. active agents.list[].tts
  3. local /tts preferences for this host
  4. inline [[tts:...]] directives when model overrides are enabled

Personas

A persona is a stable spoken identity that can be applied deterministically across providers. It can prefer one provider, define provider-neutral prompt intent, and carry provider-specific bindings for voices, models, prompt templates, seeds, and voice settings.

Minimal persona

{
  messages: {
    tts: {
      auto: "always",
      persona: "narrator",
      personas: {
        narrator: {
          label: "Narrator",
          provider: "elevenlabs",
          providers: {
            elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL", modelId: "eleven_multilingual_v2" },
          },
        },
      },
    },
  },
}

Full persona (provider-neutral prompt)

{
  messages: {
    tts: {
      auto: "always",
      persona: "alfred",
      personas: {
        alfred: {
          label: "Alfred",
          description: "Dry, warm British butler narrator.",
          provider: "google",
          fallbackPolicy: "preserve-persona",
          prompt: {
            profile: "A brilliant British butler. Dry, witty, warm, charming, emotionally expressive, never generic.",
            scene: "A quiet late-night study. Close-mic narration for a trusted operator.",
            sampleContext: "The speaker is answering a private technical request with concise confidence and dry warmth.",
            style: "Refined, understated, lightly amused.",
            accent: "British English.",
            pacing: "Measured, with short dramatic pauses.",
            constraints: ["Do not read configuration values aloud.", "Do not explain the persona."],
          },
          providers: {
            google: {
              model: "gemini-3.1-flash-tts-preview",
              voiceName: "Algieba",
              promptTemplate: "audio-profile-v1",
            },
            openai: { model: "gpt-4o-mini-tts", voice: "cedar" },
            elevenlabs: {
              voiceId: "voice_id",
              modelId: "eleven_multilingual_v2",
              seed: 42,
              voiceSettings: {
                stability: 0.65,
                similarityBoost: 0.8,
                style: 0.25,
                useSpeakerBoost: true,
                speed: 0.95,
              },
            },
          },
        },
      },
    },
  },
}

Persona resolution

The active persona is selected deterministically:

  1. /tts persona <id> local preference, if set.
  2. messages.tts.persona, if set.
  3. No persona.

Provider selection runs explicit-first:

  1. Direct overrides (CLI, gateway, Talk, allowed TTS directives).
  2. /tts provider <id> local preference.
  3. Active persona's provider.
  4. messages.tts.provider.
  5. Registry auto-select.

For each provider attempt, OpenClaw merges configs in this order:

  1. messages.tts.providers.<id>
  2. messages.tts.personas.<persona>.providers.<id>
  3. Trusted request overrides
  4. Allowed model-emitted TTS directive overrides

How providers use persona prompts

Persona prompt fields (profile, scene, sampleContext, style, accent, pacing, constraints) are provider-neutral. Each provider decides how to use them:

Wraps persona prompt fields in a Gemini TTS prompt structure **only when** the effective Google provider config sets `promptTemplate: "audio-profile-v1"` or `personaPrompt`. The older `audioProfile` and `speakerName` fields are still prepended as Google-specific prompt text. Inline audio tags such as `[whispers]` or `[laughs]` inside a `tts:text` block are preserved inside the Gemini transcript; OpenClaw does not generate these tags. Maps persona prompt fields to the request `instructions` field **only when** no explicit OpenAI `instructions` is configured. Explicit `instructions` always wins. Use only the provider-specific persona bindings under `personas..providers.`. Persona prompt fields are ignored unless the provider implements its own persona-prompt mapping.

Fallback policy

fallbackPolicy controls behavior when a persona has no binding for the attempted provider:

Policy Behavior
preserve-persona Default. Provider-neutral prompt fields stay available; the provider may use them or ignore them.
provider-defaults Persona is omitted from prompt preparation for that attempt; the provider uses its neutral defaults while fallback to other providers continues.
fail Skip that provider attempt with reasonCode: "not_configured" and personaBinding: "missing". Fallback providers are still tried.

The whole TTS request only fails when every attempted provider is skipped or fails.

Model-driven directives

By default, the assistant can emit [[tts:...]] directives to override voice, model, or speed for a single reply, plus an optional [[tts:text]]...[[/tts:text]] block for expressive cues that should appear in audio only:

Here you go.

[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]

When messages.tts.auto is "tagged", directives are required to trigger audio. Streaming block delivery strips directives from visible text before the channel sees them, even when split across adjacent blocks.

provider=... is ignored unless modelOverrides.allowProvider: true. When a reply declares provider=..., the other keys in that directive are parsed only by that provider; unsupported keys are stripped and reported as TTS directive warnings.

Available directive keys:

  • provider (registered provider id; requires allowProvider: true)
  • voice / voiceName / voice_name / google_voice / voiceId
  • model / google_model
  • stability, similarityBoost, style, speed, useSpeakerBoost
  • vol / volume (MiniMax volume, 010)
  • pitch (MiniMax integer pitch, 12 to 12; fractional values are truncated)
  • emotion (Volcengine emotion tag)
  • applyTextNormalization (auto|on|off)
  • languageCode (ISO 639-1)
  • seed

Disable model overrides entirely:

{ messages: { tts: { modelOverrides: { enabled: false } } } }

Allow provider switching while keeping other knobs configurable:

{ messages: { tts: { modelOverrides: { enabled: true, allowProvider: true, allowSeed: false } } } }

Slash commands

Single command /tts. On Discord, OpenClaw also registers /voice because /tts is a built-in Discord command — text /tts ... still works.

/tts off | on | status
/tts chat on | off | default
/tts latest
/tts provider <id>
/tts persona <id> | off
/tts limit <chars>
/tts summary off
/tts audio <text>
Commands require an authorized sender (allowlist/owner rules apply) and either `commands.text` or native command registration must be enabled.

Behavior notes:

  • /tts on writes the local TTS preference to always; /tts off writes it to off.
  • /tts chat on|off|default writes a session-scoped auto-TTS override for the current chat.
  • /tts persona <id> writes the local persona preference; /tts persona off clears it.
  • /tts latest reads the latest assistant reply from the current session transcript and sends it as audio once. It stores only a hash of that reply on the session entry to suppress duplicate voice sends.
  • /tts audio generates a one-off audio reply (does not toggle TTS on).
  • limit and summary are stored in local prefs, not the main config.
  • /tts status includes fallback diagnostics for the latest attempt — Fallback: <primary> -> <used>, Attempts: ..., and per-attempt detail (provider:outcome(reasonCode) latency).
  • /status shows the active TTS mode plus configured provider, model, voice, and sanitized custom endpoint metadata when TTS is enabled.

Per-user preferences

Slash commands write local overrides to prefsPath. The default is ~/.openclaw/settings/tts.json; override with the OPENCLAW_TTS_PREFS env var or messages.tts.prefsPath.

Stored field Effect
auto Local auto-TTS override (always, off, …)
provider Local primary provider override
persona Local persona override
maxLength Summary threshold (default 1500 chars)
summarize Summary toggle (default true)

These override the effective config from messages.tts plus the active agents.list[].tts block for that host.

Auto-TTS behavior

When messages.tts.auto is enabled, OpenClaw:

  • Skips TTS if the reply already contains media or a MEDIA: directive.
  • Skips very short replies (under 10 chars).
  • Summarizes long replies when summaries are enabled, using summaryModel (or agents.defaults.model.primary).
  • Attaches the generated audio to the reply.
  • In mode: "final", still sends audio-only TTS for streamed final replies after the text stream completes; the generated media goes through the same channel media normalization as normal reply attachments.

If the reply exceeds maxLength and summary is off (or no API key for the summary model), audio is skipped and the normal text reply is sent.

Reply -> TTS enabled?
  no  -> send text
  yes -> has media / MEDIA: / short?
          yes -> send text
          no  -> length > limit?
                   no  -> TTS -> attach audio
                   yes -> summary enabled?
                            no  -> send text
                            yes -> summarize -> TTS -> attach audio

Output formats by channel

Target Format
Feishu / Matrix / Telegram / WhatsApp Voice-note replies prefer Opus (opus_48000_64 from ElevenLabs, opus from OpenAI). 48 kHz / 64 kbps balances clarity and size.
Other channels MP3 (mp3_44100_128 from ElevenLabs, mp3 from OpenAI). 44.1 kHz / 128 kbps default for speech.
Talk / telephony Provider-native PCM (Inworld 22050 Hz, Google 24 kHz), or ulaw_8000 from Gradium for telephony.

Per-provider notes:

  • Feishu / WhatsApp transcoding: When a voice-note reply lands as MP3/WebM/WAV/M4A, the channel plugin transcodes to 48 kHz Ogg/Opus with ffmpeg. WhatsApp sends through Baileys with ptt: true and audio/ogg; codecs=opus. If conversion fails: Feishu falls back to attaching the original file; WhatsApp send fails rather than posting an incompatible PTT payload.
  • MiniMax / Xiaomi MiMo: Default MP3 (32 kHz for MiniMax speech-2.8-hd); transcoded to 48 kHz Opus for voice-note targets via ffmpeg.
  • Local CLI: Uses configured outputFormat. Voice-note targets are converted to Ogg/Opus and telephony output to raw 16 kHz mono PCM.
  • Google Gemini: Returns raw 24 kHz PCM. OpenClaw wraps as WAV for attachments, transcodes to 48 kHz Opus for voice-note targets, returns PCM directly for Talk/telephony.
  • Inworld: MP3 attachments, native OGG_OPUS voice-note, raw PCM 22050 Hz for Talk/telephony.
  • xAI: MP3 by default; responseFormat may be mp3|wav|pcm|mulaw|alaw. Uses xAI's batch REST endpoint — streaming WebSocket TTS is not used. Native Opus voice-note format is not supported.
  • Microsoft: Uses microsoft.outputFormat (default audio-24khz-48kbitrate-mono-mp3). Telegram sendVoice accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. If the configured Microsoft format fails, OpenClaw retries with MP3.

OpenAI and ElevenLabs output formats are fixed per channel as listed above.

Field reference

Auto-TTS mode. `inbound` only sends audio after an inbound voice message; `tagged` only sends audio when the reply includes `tts:...` directives or a `tts:text` block. Legacy toggle. `openclaw doctor --fix` migrates this to `auto`. `"all"` includes tool/block replies in addition to final replies. Speech provider id. When unset, OpenClaw uses the first configured provider in registry auto-select order. Legacy `provider: "edge"` is rewritten to `"microsoft"` by `openclaw doctor --fix`. Active persona id from `personas`. Normalized to lowercase. Stable spoken identity. Fields: `label`, `description`, `provider`, `fallbackPolicy`, `prompt`, `providers.`. See [Personas](#personas). Cheap model for auto-summary; defaults to `agents.defaults.model.primary`. Accepts `provider/model` or a configured model alias. Allow the model to emit TTS directives. `enabled` defaults to `true`; `allowProvider` defaults to `false`. Provider-owned settings keyed by speech provider id. Legacy direct blocks (`messages.tts.openai`, `.elevenlabs`, `.microsoft`, `.edge`) are rewritten by `openclaw doctor --fix`; commit only `messages.tts.providers.`. Hard cap for TTS input characters. `/tts audio` fails if exceeded. Request timeout in milliseconds. Override the local prefs JSON path (provider/limit/summary). Default `~/.openclaw/settings/tts.json`. Env: `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`. Azure Speech region (e.g. `eastus`). Env: `AZURE_SPEECH_REGION` or `SPEECH_REGION`. Optional Azure Speech endpoint override (alias `baseUrl`). Azure voice ShortName. Default `en-US-JennyNeural`. SSML language code. Default `en-US`. Azure `X-Microsoft-OutputFormat` for standard audio. Default `audio-24khz-48kbitrate-mono-mp3`. Azure `X-Microsoft-OutputFormat` for voice-note output. Default `ogg-24khz-16bit-mono-opus`. Falls back to `ELEVENLABS_API_KEY` or `XI_API_KEY`. Model id (e.g. `eleven_multilingual_v2`, `eleven_v3`). ElevenLabs voice id. `stability`, `similarityBoost`, `style` (each `0..1`), `useSpeakerBoost` (`true|false`), `speed` (`0.5..2.0`, `1.0` = normal). Text normalization mode. 2-letter ISO 639-1 (e.g. `en`, `de`). Integer `0..4294967295` for best-effort determinism. Override ElevenLabs API base URL. Falls back to `GEMINI_API_KEY` / `GOOGLE_API_KEY`. If omitted, TTS can reuse `models.providers.google.apiKey` before env fallback. Gemini TTS model. Default `gemini-3.1-flash-tts-preview`. Gemini prebuilt voice name. Default `Kore`. Alias: `voice`. Natural-language style prompt prepended before spoken text. Optional speaker label prepended before spoken text when your prompt uses a named speaker. Set to `audio-profile-v1` to wrap active persona prompt fields in a deterministic Gemini TTS prompt structure. Google-specific extra persona prompt text appended to the template's Director's Notes. Only `https://generativelanguage.googleapis.com` is accepted. Env: `GRADIUM_API_KEY`. Default `https://api.gradium.ai`. Default Emma (`YTpq7expH9539ERJ`). Env: `INWORLD_API_KEY`. Default `https://api.inworld.ai`. Default `inworld-tts-1.5-max`. Also: `inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`. Default `Sarah`. Sampling temperature `0..2`. Local executable or command string for CLI TTS. Command arguments. Supports `{{Text}}`, `{{OutputPath}}`, `{{OutputDir}}`, `{{OutputBase}}` placeholders. Expected CLI output format. Default `mp3` for audio attachments. Command timeout in milliseconds. Default `120000`. Optional command working directory. Optional environment overrides for the command. Allow Microsoft speech usage. Microsoft neural voice name (e.g. `en-US-MichelleNeural`). Language code (e.g. `en-US`). Microsoft output format. Default `audio-24khz-48kbitrate-mono-mp3`. Not all formats are supported by the bundled Edge-backed transport. Percent strings (e.g. `+10%`, `-5%`). Write JSON subtitles alongside the audio file. Proxy URL for Microsoft speech requests. Request timeout override (ms). Legacy alias. Run `openclaw doctor --fix` to rewrite persisted config to `providers.microsoft`. Falls back to `MINIMAX_API_KEY`. Token Plan auth via `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or `MINIMAX_CODING_API_KEY`. Default `https://api.minimax.io`. Env: `MINIMAX_API_HOST`. Default `speech-2.8-hd`. Env: `MINIMAX_TTS_MODEL`. Default `English_expressive_narrator`. Env: `MINIMAX_TTS_VOICE_ID`. `0.5..2.0`. Default `1.0`. `(0, 10]`. Default `1.0`. Integer `-12..12`. Default `0`. Fractional values are truncated before the request. Falls back to `OPENAI_API_KEY`. OpenAI TTS model id (e.g. `gpt-4o-mini-tts`). Voice name (e.g. `alloy`, `cedar`). Explicit OpenAI `instructions` field. When set, persona prompt fields are **not** auto-mapped. Override the OpenAI TTS endpoint. Resolution order: config → `OPENAI_TTS_BASE_URL` → `https://api.openai.com/v1`. Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted. Env: `OPENROUTER_API_KEY`. Can reuse `models.providers.openrouter.apiKey`. Default `https://openrouter.ai/api/v1`. Legacy `https://openrouter.ai/v1` is normalized. Default `hexgrad/kokoro-82m`. Alias: `modelId`. Default `af_alloy`. Alias: `voiceId`. Default `mp3`. Provider-native speed override. Env: `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY`. Default `seed-tts-1.0`. Env: `VOLCENGINE_TTS_RESOURCE_ID`. Use `seed-tts-2.0` when your project has TTS 2.0 entitlement. App key header. Default `aGjiRDfUWi`. Env: `VOLCENGINE_TTS_APP_KEY`. Override the Seed Speech TTS HTTP endpoint. Env: `VOLCENGINE_TTS_BASE_URL`. Voice type. Default `en_female_anna_mars_bigtts`. Env: `VOLCENGINE_TTS_VOICE`. Provider-native speed ratio. Provider-native emotion tag. Legacy Volcengine Speech Console fields. Env: `VOLCENGINE_TTS_APPID`, `VOLCENGINE_TTS_TOKEN`, `VOLCENGINE_TTS_CLUSTER` (default `volcano_tts`). Env: `XAI_API_KEY`. Default `https://api.x.ai/v1`. Env: `XAI_BASE_URL`. Default `eve`. Live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`. BCP-47 language code or `auto`. Default `en`. Default `mp3`. Provider-native speed override. Env: `XIAOMI_API_KEY`. Default `https://api.xiaomimimo.com/v1`. Env: `XIAOMI_BASE_URL`. Default `mimo-v2.5-tts`. Env: `XIAOMI_TTS_MODEL`. Also supports `mimo-v2-tts`. Default `mimo_default`. Env: `XIAOMI_TTS_VOICE`. Default `mp3`. Env: `XIAOMI_TTS_FORMAT`. Optional natural-language style instruction sent as the user message; not spoken.

Agent tool

The tts tool converts text to speech and returns an audio attachment for reply delivery. On Feishu, Matrix, Telegram, and WhatsApp, the audio is delivered as a voice message rather than a file attachment. Feishu and WhatsApp can transcode non-Opus TTS output on this path when ffmpeg is available.

WhatsApp sends audio through Baileys as a PTT voice note (audio with ptt: true) and sends visible text separately from PTT audio because clients do not consistently render captions on voice notes.

The tool accepts optional channel and timeoutMs fields; timeoutMs is a per-call provider request timeout in milliseconds.

Gateway RPC

Method Purpose
tts.status Read current TTS state and last attempt.
tts.enable Set local auto preference to always.
tts.disable Set local auto preference to off.
tts.convert One-off text → audio.
tts.setProvider Set local provider preference.
tts.setPersona Set local persona preference.
tts.providers List configured providers and status.