openclaw/docs/tools/tts.md at 00d2fbfda4d8fd783bfce491f6958dfeb0ca31fb

mirror of https://github.com/openclaw/openclaw.git synced 2026-05-03 06:57:09 +02:00

Files

Peter Steinberger 5b80d0c15e feat(tts): add Azure Speech provider

Co-authored-by: Leon Chui <84605354+leonchui@users.noreply.github.com>

2026-04-26 01:42:51 +01:00

31 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Text-to-speech (TTS) for outbound replies

Enabling text-to-speech for replies

Configuring TTS providers or limits

Using /tts commands

Text-to-speech

OpenClaw can convert outbound replies into audio using Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo. It works anywhere OpenClaw can send audio.

Supported services

Azure Speech (primary or fallback provider; uses the Azure AI Speech REST API)
ElevenLabs (primary or fallback provider)
Google Gemini (primary or fallback provider; uses Gemini API TTS)
Gradium (primary or fallback provider; supports voice-note and telephony output)
Inworld (primary or fallback provider; uses the Inworld streaming TTS API)
Local CLI (primary or fallback provider; runs a configured local TTS command)
Microsoft (primary or fallback provider; current bundled implementation uses node-edge-tts)
MiniMax (primary or fallback provider; uses the T2A v2 API)
OpenAI (primary or fallback provider; also used for summaries)
Volcengine (primary or fallback provider; uses the BytePlus Seed Speech HTTP API)
Vydra (primary or fallback provider; shared image, video, and speech provider)
xAI (primary or fallback provider; uses the xAI TTS API)
Xiaomi MiMo (primary or fallback provider; uses MiMo TTS through Xiaomi chat completions)

Microsoft speech notes

The bundled Microsoft speech provider currently uses Microsoft Edge's online neural TTS service via the node-edge-tts library. It's a hosted service (not local), uses Microsoft endpoints, and does not require an API key. node-edge-tts exposes speech configuration options and output formats, but not all options are supported by the service. Legacy config and directive input using edge still works and is normalized to microsoft.

Because this path is a public web service without a published SLA or quota, treat it as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs.

Optional keys

If you want Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:

AZURE_SPEECH_KEY plus AZURE_SPEECH_REGION (also accepts AZURE_SPEECH_API_KEY, SPEECH_KEY, and SPEECH_REGION)
ELEVENLABS_API_KEY (or XI_API_KEY)
GEMINI_API_KEY (or GOOGLE_API_KEY)
GRADIUM_API_KEY
INWORLD_API_KEY
MINIMAX_API_KEY; MiniMax TTS also accepts Token Plan auth via MINIMAX_OAUTH_TOKEN, MINIMAX_CODE_PLAN_KEY, or MINIMAX_CODING_API_KEY
OPENAI_API_KEY
VOLCENGINE_TTS_API_KEY (or BYTEPLUS_SEED_SPEECH_API_KEY); legacy AppID/token auth also accepts VOLCENGINE_TTS_APPID and VOLCENGINE_TTS_TOKEN
VYDRA_API_KEY
XAI_API_KEY
XIAOMI_API_KEY

Local CLI and Microsoft speech do not require an API key.

If multiple providers are configured, the selected provider is used first and the others are fallback options. Auto-summary uses the configured summaryModel (or agents.defaults.model.primary), so that provider must also be authenticated if you enable summaries.

Service links

Is it enabled by default?

No. Auto‑TTS is off by default. Enable it in config with messages.tts.auto or locally with /tts on.

When messages.tts.provider is unset, OpenClaw picks the first configured speech provider in registry auto-select order.

Config

TTS config lives under messages.tts in openclaw.json. Full schema is in Gateway configuration.

Minimal config (enable + provider)

{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
    },
  },
}

OpenAI primary with ElevenLabs fallback

{
  messages: {
    tts: {
      auto: "always",
      provider: "openai",
      summaryModel: "openai/gpt-4.1-mini",
      modelOverrides: {
        enabled: true,
      },
      providers: {
        openai: {
          apiKey: "openai_api_key",
          baseUrl: "https://api.openai.com/v1",
          model: "gpt-4o-mini-tts",
          voice: "alloy",
        },
        elevenlabs: {
          apiKey: "elevenlabs_api_key",
          baseUrl: "https://api.elevenlabs.io",
          voiceId: "voice_id",
          modelId: "eleven_multilingual_v2",
          seed: 42,
          applyTextNormalization: "auto",
          languageCode: "en",
          voiceSettings: {
            stability: 0.5,
            similarityBoost: 0.75,
            style: 0.0,
            useSpeakerBoost: true,
            speed: 1.0,
          },
        },
      },
    },
  },
}

Azure Speech primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "azure-speech",
      providers: {
        "azure-speech": {
          // apiKey falls back to AZURE_SPEECH_KEY.
          // region falls back to AZURE_SPEECH_REGION.
          voice: "en-US-JennyNeural",
          lang: "en-US",
          outputFormat: "audio-24khz-48kbitrate-mono-mp3",
          voiceNoteOutputFormat: "ogg-24khz-16bit-mono-opus",
        },
      },
    },
  },
}

Azure Speech uses a Speech resource key, not an Azure OpenAI key. Resolution order is messages.tts.providers.azure-speech.apiKey -> AZURE_SPEECH_KEY -> AZURE_SPEECH_API_KEY -> SPEECH_KEY, plus messages.tts.providers.azure-speech.region -> AZURE_SPEECH_REGION -> SPEECH_REGION for the region. New config should use azure-speech; azure is accepted as a provider alias.

Microsoft primary (no API key)

{
  messages: {
    tts: {
      auto: "always",
      provider: "microsoft",
      providers: {
        microsoft: {
          enabled: true,
          voice: "en-US-MichelleNeural",
          lang: "en-US",
          outputFormat: "audio-24khz-48kbitrate-mono-mp3",
          rate: "+10%",
          pitch: "-5%",
        },
      },
    },
  },
}

MiniMax primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "minimax",
      providers: {
        minimax: {
          apiKey: "minimax_api_key",
          baseUrl: "https://api.minimax.io",
          model: "speech-2.8-hd",
          voiceId: "English_expressive_narrator",
          speed: 1.0,
          vol: 1.0,
          pitch: 0,
        },
      },
    },
  },
}

MiniMax TTS auth resolution is messages.tts.providers.minimax.apiKey, then stored minimax-portal OAuth/token profiles, then Token Plan environment keys (MINIMAX_OAUTH_TOKEN, MINIMAX_CODE_PLAN_KEY, MINIMAX_CODING_API_KEY), then MINIMAX_API_KEY. When no explicit TTS baseUrl is set, OpenClaw can reuse the configured minimax-portal OAuth host for Token Plan speech.

Google Gemini primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "google",
      providers: {
        google: {
          apiKey: "gemini_api_key",
          model: "gemini-3.1-flash-tts-preview",
          voiceName: "Kore",
        },
      },
    },
  },
}

Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key restricted to the Gemini API is valid here, and it is the same style of key used by the bundled Google image-generation provider. Resolution order is messages.tts.providers.google.apiKey -> models.providers.google.apiKey -> GEMINI_API_KEY -> GOOGLE_API_KEY.

Inworld primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "inworld",
      providers: {
        inworld: {
          apiKey: "inworld_api_key",
          baseUrl: "https://api.inworld.ai",
          voiceId: "Sarah",
          modelId: "inworld-tts-1.5-max",
          temperature: 0.8,
        },
      },
    },
  },
}

The apiKey value must be the Base64-encoded credential string copied verbatim from the Inworld dashboard (Workspace > API Keys). The provider sends it as Authorization: Basic <apiKey> without any additional encoding, so do not pass a raw bearer token and do not Base64-encode it yourself. The key falls back to the INWORLD_API_KEY env var. See Inworld provider for full setup.

Volcengine primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "volcengine",
      providers: {
        volcengine: {
          apiKey: "byteplus_seed_speech_api_key",
          resourceId: "seed-tts-1.0",
          voice: "en_female_anna_mars_bigtts",
          speedRatio: 1.0,
        },
      },
    },
  },
}

Volcengine TTS uses the BytePlus Seed Speech API key from the Speech Console, not the OpenAI-compatible VOLCANO_ENGINE_API_KEY used for Doubao model providers. Resolution order is messages.tts.providers.volcengine.apiKey -> VOLCENGINE_TTS_API_KEY -> BYTEPLUS_SEED_SPEECH_API_KEY. Legacy AppID/token auth still works through messages.tts.providers.volcengine.appId / token or VOLCENGINE_TTS_APPID / VOLCENGINE_TTS_TOKEN. Voice-note targets request provider-native ogg_opus; normal audio-file targets request mp3.

xAI primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "xai",
      providers: {
        xai: {
          apiKey: "xai_api_key",
          voiceId: "eve",
          language: "en",
          responseFormat: "mp3",
          speed: 1.0,
        },
      },
    },
  },
}

xAI TTS uses the same XAI_API_KEY path as the bundled Grok model provider. Resolution order is messages.tts.providers.xai.apiKey -> XAI_API_KEY. Current live voices are ara, eve, leo, rex, sal, and una; eve is the default. language accepts a BCP-47 tag or auto.

Xiaomi MiMo primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "xiaomi",
      providers: {
        xiaomi: {
          apiKey: "xiaomi_api_key",
          baseUrl: "https://api.xiaomimimo.com/v1",
          model: "mimo-v2.5-tts",
          voice: "mimo_default",
          format: "mp3",
          style: "Bright, natural, conversational tone.",
        },
      },
    },
  },
}

Xiaomi MiMo TTS uses the same XIAOMI_API_KEY path as the bundled Xiaomi model provider. The speech provider id is xiaomi; mimo is accepted as an alias. The target text is sent as the assistant message, matching Xiaomi's TTS contract. Optional style is sent as a user instruction and is not spoken.

OpenRouter primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "openrouter",
      providers: {
        openrouter: {
          apiKey: "openrouter_api_key",
          model: "hexgrad/kokoro-82m",
          voice: "af_alloy",
          responseFormat: "mp3",
        },
      },
    },
  },
}

OpenRouter TTS uses the same OPENROUTER_API_KEY path as the bundled OpenRouter model provider. Resolution order is messages.tts.providers.openrouter.apiKey -> models.providers.openrouter.apiKey -> OPENROUTER_API_KEY.

Local CLI primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "tts-local-cli",
      providers: {
        "tts-local-cli": {
          command: "say",
          args: ["-o", "{{OutputPath}}", "{{Text}}"],
          outputFormat: "wav",
          timeoutMs: 120000,
        },
      },
    },
  },
}

Local CLI TTS runs the configured command on the gateway host. {{Text}}, {{OutputPath}}, {{OutputDir}}, and {{OutputBase}} placeholders are expanded in args; if no {{Text}} placeholder is present, OpenClaw writes the spoken text to stdin. outputFormat accepts mp3, opus, or wav. Voice-note targets are transcoded to Ogg/Opus and telephony output is transcoded to raw 16 kHz mono PCM with ffmpeg. The legacy provider alias cli still works, but new config should use tts-local-cli.

Gradium primary

{
  messages: {
    tts: {
      auto: "always",
      provider: "gradium",
      providers: {
        gradium: {
          apiKey: "gradium_api_key",
          baseUrl: "https://api.gradium.ai",
          voiceId: "YTpq7expH9539ERJ",
        },
      },
    },
  },
}

Disable Microsoft speech

{
  messages: {
    tts: {
      providers: {
        microsoft: {
          enabled: false,
        },
      },
    },
  },
}

Custom limits + prefs path

{
  messages: {
    tts: {
      auto: "always",
      maxTextLength: 4000,
      timeoutMs: 30000,
      prefsPath: "~/.openclaw/settings/tts.json",
    },
  },
}

Only reply with audio after an inbound voice message

{
  messages: {
    tts: {
      auto: "inbound",
    },
  },
}

Disable auto-summary for long replies

{
  messages: {
    tts: {
      auto: "always",
    },
  },
}

Then run:

/tts summary off

Notes on fields

auto: auto‑TTS mode (off, always, inbound, tagged).
- inbound only sends audio after an inbound voice message.
- tagged only sends audio when the reply includes [[tts:key=value]] directives or a [[tts:text]]...[[/tts:text]] block.
enabled: legacy toggle (doctor migrates this to auto).
mode: "final" (default) or "all" (includes tool/block replies).
provider: speech provider id such as "elevenlabs", "google", "gradium", "inworld", "microsoft", "minimax", "openai", "volcengine", "vydra", "xai", or "xiaomi" (fallback is automatic).
If provider is unset, OpenClaw uses the first configured speech provider in registry auto-select order.
Legacy provider: "edge" config is repaired by openclaw doctor --fix and rewritten to provider: "microsoft".
summaryModel: optional cheap model for auto-summary; defaults to agents.defaults.model.primary.
- Accepts provider/model or a configured model alias.
modelOverrides: allow the model to emit TTS directives (on by default).
- allowProvider defaults to false (provider switching is opt-in).
providers.<id>: provider-owned settings keyed by speech provider id.
Legacy direct provider blocks (messages.tts.openai, messages.tts.elevenlabs, messages.tts.microsoft, messages.tts.edge) are repaired by openclaw doctor --fix; committed config should use messages.tts.providers.<id>.
Legacy messages.tts.providers.edge is also repaired by openclaw doctor --fix; committed config should use messages.tts.providers.microsoft.
maxTextLength: hard cap for TTS input (chars). /tts audio fails if exceeded.
timeoutMs: request timeout (ms).
prefsPath: override the local prefs JSON path (provider/limit/summary).
apiKey values fall back to env vars (AZURE_SPEECH_KEY/AZURE_SPEECH_API_KEY/SPEECH_KEY, ELEVENLABS_API_KEY/XI_API_KEY, GEMINI_API_KEY/GOOGLE_API_KEY, GRADIUM_API_KEY, INWORLD_API_KEY, MINIMAX_API_KEY, OPENAI_API_KEY, VYDRA_API_KEY, XAI_API_KEY, XIAOMI_API_KEY). Volcengine uses appId/token instead.
providers.azure-speech.apiKey: Azure Speech resource key (env: AZURE_SPEECH_KEY, AZURE_SPEECH_API_KEY, or SPEECH_KEY).
providers.azure-speech.region: Azure Speech region such as eastus (env: AZURE_SPEECH_REGION or SPEECH_REGION).
providers.azure-speech.endpoint / providers.azure-speech.baseUrl: optional Azure Speech endpoint/base URL override.
providers.azure-speech.voice: Azure voice ShortName (default en-US-JennyNeural).
providers.azure-speech.lang: SSML language code (default en-US).
providers.azure-speech.outputFormat: Azure X-Microsoft-OutputFormat for standard audio output (default audio-24khz-48kbitrate-mono-mp3).
providers.azure-speech.voiceNoteOutputFormat: Azure X-Microsoft-OutputFormat for voice-note output (default ogg-24khz-16bit-mono-opus).
providers.elevenlabs.baseUrl: override ElevenLabs API base URL.
providers.openai.baseUrl: override the OpenAI TTS endpoint.
- Resolution order: messages.tts.providers.openai.baseUrl -> OPENAI_TTS_BASE_URL -> https://api.openai.com/v1
- Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted.
providers.elevenlabs.voiceSettings:
- stability, similarityBoost, style: 0..1
- useSpeakerBoost: true|false
- speed: 0.5..2.0 (1.0 = normal)
providers.elevenlabs.applyTextNormalization: auto|on|off
providers.elevenlabs.languageCode: 2-letter ISO 639-1 (e.g. en, de)
providers.elevenlabs.seed: integer 0..4294967295 (best-effort determinism)
providers.minimax.baseUrl: override MiniMax API base URL (default https://api.minimax.io, env: MINIMAX_API_HOST).
providers.minimax.model: TTS model (default speech-2.8-hd, env: MINIMAX_TTS_MODEL).
providers.minimax.voiceId: voice identifier (default English_expressive_narrator, env: MINIMAX_TTS_VOICE_ID).
providers.minimax.speed: playback speed 0.5..2.0 (default 1.0).
providers.minimax.vol: volume (0, 10] (default 1.0; must be greater than 0).
providers.minimax.pitch: integer pitch shift -12..12 (default 0). Fractional values are truncated before calling MiniMax T2A because the API rejects non-integer pitch values.
providers.tts-local-cli.command: local executable or command string for CLI TTS.
providers.tts-local-cli.args: command arguments; supports {{Text}}, {{OutputPath}}, {{OutputDir}}, and {{OutputBase}} placeholders.
providers.tts-local-cli.outputFormat: expected CLI output format (mp3, opus, or wav; default mp3 for audio attachments).
providers.tts-local-cli.timeoutMs: command timeout in milliseconds (default 120000).
providers.tts-local-cli.cwd: optional command working directory.
providers.tts-local-cli.env: optional string environment overrides for the command.
providers.inworld.baseUrl: override Inworld API base URL (default https://api.inworld.ai).
providers.inworld.voiceId: Inworld voice identifier (default Sarah).
providers.inworld.modelId: Inworld TTS model (default inworld-tts-1.5-max; also supports inworld-tts-1.5-mini, inworld-tts-1-max, inworld-tts-1).
providers.inworld.temperature: sampling temperature 0..2 (optional).
providers.google.model: Gemini TTS model (default gemini-3.1-flash-tts-preview).
providers.google.voiceName: Gemini prebuilt voice name (default Kore; voice is also accepted).
providers.google.audioProfile: natural-language style prompt prepended before the spoken text.
providers.google.speakerName: optional speaker label prepended before the spoken text when your TTS prompt uses a named speaker.
providers.google.baseUrl: override the Gemini API base URL. Only https://generativelanguage.googleapis.com is accepted.
- If messages.tts.providers.google.apiKey is omitted, TTS can reuse models.providers.google.apiKey before env fallback.
providers.gradium.baseUrl: override Gradium API base URL (default https://api.gradium.ai).
providers.gradium.voiceId: Gradium voice identifier (default Emma, YTpq7expH9539ERJ).
providers.volcengine.apiKey: BytePlus Seed Speech API key (env: VOLCENGINE_TTS_API_KEY or BYTEPLUS_SEED_SPEECH_API_KEY).
providers.volcengine.resourceId: BytePlus Seed Speech resource id (default seed-tts-1.0, env: VOLCENGINE_TTS_RESOURCE_ID; use seed-tts-2.0 when your BytePlus project has TTS 2.0 entitlement).
providers.volcengine.appKey: BytePlus Seed Speech app key header (default aGjiRDfUWi, env: VOLCENGINE_TTS_APP_KEY).
providers.volcengine.baseUrl: override the Seed Speech TTS HTTP endpoint (env: VOLCENGINE_TTS_BASE_URL).
providers.volcengine.appId: legacy Volcengine Speech Console application id (env: VOLCENGINE_TTS_APPID).
providers.volcengine.token: legacy Volcengine Speech Console access token (env: VOLCENGINE_TTS_TOKEN).
providers.volcengine.cluster: legacy Volcengine TTS cluster (default volcano_tts, env: VOLCENGINE_TTS_CLUSTER).
providers.volcengine.voice: voice type (default en_female_anna_mars_bigtts, env: VOLCENGINE_TTS_VOICE).
providers.volcengine.speedRatio: provider-native speed ratio.
providers.volcengine.emotion: provider-native emotion tag.
providers.xai.apiKey: xAI TTS API key (env: XAI_API_KEY).
providers.xai.baseUrl: override the xAI TTS base URL (default https://api.x.ai/v1, env: XAI_BASE_URL).
providers.xai.voiceId: xAI voice id (default eve; current live voices: ara, eve, leo, rex, sal, una).
providers.xai.language: BCP-47 language code or auto (default en).
providers.xai.responseFormat: mp3, wav, pcm, mulaw, or alaw (default mp3).
providers.xai.speed: provider-native speed override.
providers.xiaomi.apiKey: Xiaomi MiMo API key (env: XIAOMI_API_KEY).
providers.xiaomi.baseUrl: override the Xiaomi MiMo API base URL (default https://api.xiaomimimo.com/v1, env: XIAOMI_BASE_URL).
providers.xiaomi.model: TTS model (default mimo-v2.5-tts, env: XIAOMI_TTS_MODEL; mimo-v2-tts is also supported).
providers.xiaomi.voice: MiMo voice id (default mimo_default, env: XIAOMI_TTS_VOICE).
providers.xiaomi.format: mp3 or wav (default mp3, env: XIAOMI_TTS_FORMAT).
providers.xiaomi.style: optional natural-language style instruction sent as the user message; it is not spoken.
providers.openrouter.apiKey: OpenRouter API key (env: OPENROUTER_API_KEY; can reuse models.providers.openrouter.apiKey).
providers.openrouter.baseUrl: override the OpenRouter TTS base URL (default https://openrouter.ai/api/v1; legacy https://openrouter.ai/v1 is normalized).
providers.openrouter.model: OpenRouter TTS model id (default hexgrad/kokoro-82m; modelId is also accepted).
providers.openrouter.voice: provider-specific voice id (default af_alloy; voiceId is also accepted).
providers.openrouter.responseFormat: mp3 or pcm (default mp3).
providers.openrouter.speed: provider-native speed override.
providers.microsoft.enabled: allow Microsoft speech usage (default true; no API key).
providers.microsoft.voice: Microsoft neural voice name (e.g. en-US-MichelleNeural).
providers.microsoft.lang: language code (e.g. en-US).
providers.microsoft.outputFormat: Microsoft output format (e.g. audio-24khz-48kbitrate-mono-mp3).
- See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport.
providers.microsoft.rate / providers.microsoft.pitch / providers.microsoft.volume: percent strings (e.g. +10%, -5%).
providers.microsoft.saveSubtitles: write JSON subtitles alongside the audio file.
providers.microsoft.proxy: proxy URL for Microsoft speech requests.
providers.microsoft.timeoutMs: request timeout override (ms).
edge.*: legacy alias for the same Microsoft settings. Run openclaw doctor --fix to rewrite persisted config to providers.microsoft.

Model-driven overrides (default on)

By default, the model can emit TTS directives for a single reply. When messages.tts.auto is tagged, these directives are required to trigger audio.

When enabled, the model can emit [[tts:...]] directives to override the voice for a single reply, plus an optional [[tts:text]]...[[/tts:text]] block to provide expressive tags (laughter, singing cues, etc) that should only appear in the audio.

provider=... directives are ignored unless modelOverrides.allowProvider: true.

Example reply payload:

Here you go.

[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]

Available directive keys (when enabled):

provider (registered speech provider id, for example openai, elevenlabs, google, gradium, minimax, microsoft, volcengine, vydra, xai, or xiaomi; requires allowProvider: true)
voice (OpenAI, Gradium, Volcengine, or Xiaomi voice), voiceName / voice_name / google_voice (Google voice), or voiceId (ElevenLabs / Gradium / MiniMax / xAI)
model (OpenAI TTS model, ElevenLabs model id, MiniMax model, or Xiaomi MiMo TTS model) or google_model (Google TTS model)
stability, similarityBoost, style, speed, useSpeakerBoost
vol / volume (MiniMax volume, 0-10)
pitch (MiniMax integer pitch, -12 to 12; fractional values are truncated before the MiniMax request)
emotion (Volcengine emotion tag)
applyTextNormalization (auto|on|off)
languageCode (ISO 639-1)
seed

Disable all model overrides:

{
  messages: {
    tts: {
      modelOverrides: {
        enabled: false,
      },
    },
  },
}

Optional allowlist (enable provider switching while keeping other knobs configurable):

{
  messages: {
    tts: {
      modelOverrides: {
        enabled: true,
        allowProvider: true,
        allowSeed: false,
      },
    },
  },
}

Per-user preferences

Slash commands write local overrides to prefsPath (default: ~/.openclaw/settings/tts.json, override with OPENCLAW_TTS_PREFS or messages.tts.prefsPath).

Stored fields:

enabled
provider
maxLength (summary threshold; default 1500 chars)
summarize (default true)

These override messages.tts.* for that host.

Output formats (fixed)

Feishu / Matrix / Telegram / WhatsApp: voice-note replies prefer Opus (opus_48000_64 from ElevenLabs, opus from OpenAI).
- 48kHz / 64kbps is a good voice message tradeoff.
Feishu: when a voice-note reply is produced as MP3/WAV/M4A or another likely audio file, the Feishu plugin transcodes it to 48kHz Ogg/Opus with ffmpeg before sending the native audio bubble. If conversion fails, Feishu receives the original file as an attachment.
Other channels: MP3 (mp3_44100_128 from ElevenLabs, mp3 from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
MiniMax: MP3 (speech-2.8-hd model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with ffmpeg before delivery.
Xiaomi MiMo: MP3 by default, or WAV when configured. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes Xiaomi output to 48kHz Opus with ffmpeg before delivery.
Local CLI: uses the configured outputFormat. Voice-note targets are converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM with ffmpeg.
Google Gemini: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
Gradium: WAV for audio attachments, Opus for voice-note targets, and ulaw_8000 at 8 kHz for telephony.
Inworld: MP3 for normal audio attachments, native OGG_OPUS for voice-note targets, and raw PCM at 22050 Hz for Talk/telephony.
xAI: MP3 by default; responseFormat may be mp3, wav, pcm, mulaw, or alaw. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
Microsoft: uses microsoft.outputFormat (default audio-24khz-48kbitrate-mono-mp3).
- The bundled transport accepts an outputFormat, but not all formats are available from the service.
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
- Telegram sendVoice accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages.
- If the configured Microsoft output format fails, OpenClaw retries with MP3.

OpenAI/ElevenLabs output formats are fixed per channel (see above).

Auto-TTS behavior

When enabled, OpenClaw:

skips TTS if the reply already contains media or a MEDIA: directive.
skips very short replies (< 10 chars).
summarizes long replies when enabled using agents.defaults.model.primary (or summaryModel).
attaches the generated audio to the reply.

If the reply exceeds maxLength and summary is off (or no API key for the summary model), audio is skipped and the normal text reply is sent.

Flow diagram

Reply -> TTS enabled?
  no  -> send text
  yes -> has media / MEDIA: / short?
          yes -> send text
          no  -> length > limit?
                   no  -> TTS -> attach audio
                   yes -> summary enabled?
                            no  -> send text
                            yes -> summarize (summaryModel or agents.defaults.model.primary)
                                      -> TTS -> attach audio

Slash command usage

There is a single command: /tts. See Slash commands for enablement details.

Discord note: /tts is a built-in Discord command, so OpenClaw registers /voice as the native command there. Text /tts ... still works.

/tts off
/tts on
/tts status
/tts provider openai
/tts limit 2000
/tts summary off
/tts audio Hello from OpenClaw

Notes:

Commands require an authorized sender (allowlist/owner rules still apply).
commands.text or native command registration must be enabled.
Config messages.tts.auto accepts off|always|inbound|tagged.
/tts on writes the local TTS preference to always; /tts off writes it to off.
Use config when you want inbound or tagged defaults.
limit and summary are stored in local prefs, not the main config.
/tts audio generates a one-off audio reply (does not toggle TTS on).
/tts status includes fallback visibility for the latest attempt:
- success fallback: Fallback: <primary> -> <used> plus Attempts: ...
- failure: Error: ... plus Attempts: ...
- detailed diagnostics: Attempt details: provider:outcome(reasonCode) latency
OpenAI and ElevenLabs API failures now include parsed provider error detail and request id (when returned by the provider), which is surfaced in TTS errors/logs.

Agent tool

The tts tool converts text to speech and returns an audio attachment for reply delivery. When the channel is Feishu, Matrix, Telegram, or WhatsApp, the audio is delivered as a voice message rather than a file attachment. Feishu can transcode non-Opus TTS output on this path when ffmpeg is available. WhatsApp sends visible text separately from PTT voice-note audio because clients do not consistently render captions on voice notes. It accepts optional channel and timeoutMs fields; timeoutMs is a per-call provider request timeout in milliseconds.

Gateway RPC

Gateway methods:

tts.status
tts.enable
tts.disable
tts.convert
tts.setProvider
tts.providers

31 KiB Raw Blame History Unescape Escape