Mirrors/openclaw

Fork 0

mirror of https://github.com/openclaw/openclaw.git synced 2026-04-29 04:57:09 +02:00

Files

Peter Steinberger 6a67f65568 fix(voice): reuse preflight transcripts across channels

2026-04-26 05:42:04 +01:00

7.1 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Unified landing page for media generation, understanding, and speech capabilities

Looking for an overview of media capabilities

Deciding which media provider to configure

Understanding how async media generation works

Media overview

Media Generation and Understanding

OpenClaw generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured.

Capabilities at a glance

Capability	Tool	Providers	What it does
Image generation	`image_generate`	ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI	Creates or edits images from text prompts or references
Video generation	`video_generate`	Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI	Creates videos from text, images, or existing videos
Music generation	`music_generate`	ComfyUI, Google, MiniMax	Creates music or audio tracks from text prompts
Text-to-speech (TTS)	`tts`	ElevenLabs, Google, Gradium, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, Xiaomi MiMo	Converts outbound replies to spoken audio
Media understanding	(automatic)	Any vision/audio-capable model provider, plus CLI fallbacks	Summarizes inbound images, audio, and video

Provider capability matrix

This table shows which providers support which media capabilities across the platform.

Provider	Image	Video	Music	TTS	STT / Transcription	Realtime Voice	Media Understanding
Alibaba		Yes
BytePlus		Yes
ComfyUI	Yes	Yes	Yes
Deepgram					Yes	Yes
ElevenLabs				Yes	Yes
fal	Yes	Yes
Google	Yes	Yes	Yes	Yes		Yes	Yes
Gradium				Yes
Local CLI				Yes
Microsoft				Yes
MiniMax	Yes	Yes	Yes	Yes
Mistral					Yes
OpenAI	Yes	Yes		Yes	Yes	Yes	Yes
Qwen		Yes
Runway		Yes
SenseAudio					Yes
Together		Yes
Vydra	Yes	Yes		Yes
xAI	Yes	Yes		Yes	Yes		Yes
Xiaomi MiMo	Yes			Yes			Yes

Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.

How async generation works

Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls video_generate or music_generate, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.

Deepgram, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe inbound audio through the batch tools.media.audio path when configured. Channel plugins that preflight a voice note for mention gating or command parsing mark the transcribed attachment on the inbound context, so the shared media-understanding pass reuses that transcript instead of making a second STT call for the same audio. Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording.

Google maps to OpenClaw's image, video, music, batch TTS, backend realtime voice, and media-understanding surfaces. OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call streaming STT, backend realtime voice, and memory embedding surfaces. xAI currently maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT, and Voice Call streaming STT surfaces. xAI Realtime voice is an upstream capability, but it is not registered in OpenClaw until the shared realtime voice contract can represent it.

Quick links

Image Generation -- generating and editing images
Video Generation -- text-to-video, image-to-video, and video-to-video
Music Generation -- creating music and audio tracks
Text-to-Speech -- converting replies to spoken audio
Media Understanding -- understanding inbound images, audio, and video

7.1 KiB Raw Blame History