openclaw/docs/tools/video-generation.md at 7928da0f486dcb351376bc044f84b8da679df0bc

Mirrors/openclaw

Fork 0

mirror of https://github.com/openclaw/openclaw.git synced 2026-04-12 04:14:55 +02:00

Files

Peter Steinberger a463a33eee feat: preserve media intent across provider fallback

2026-04-06 23:23:06 +01:00

17 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Generate videos from text, images, or existing videos using 12 provider backends

Generating videos via the agent

Configuring video generation providers and models

Understanding the video_generate tool parameters

Video Generation

OpenClaw agents can generate videos from text prompts, reference images, or existing videos. Twelve provider backends are supported, each with different model options, input modes, and feature sets. The agent picks the right provider automatically based on your configuration and available API keys.

The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`.

OpenClaw treats video generation as three runtime modes:

generate for text-to-video requests with no reference media
imageToVideo when the request includes one or more reference images
videoToVideo when the request includes one or more reference videos

Providers can support any subset of those modes. The tool validates the active mode before submission and reports supported modes in action=list.

Quick start

Set an API key for any supported provider:

export GEMINI_API_KEY="your-key"

Optionally pin a default model:

openclaw config set agents.defaults.videoGenerationModel.primary "google/veo-3.1-fast-generate-preview"

Ask the agent:

Generate a 5-second cinematic video of a friendly lobster surfing at sunset.

The agent calls video_generate automatically. No tool allowlisting is needed.

What happens when you generate a video

Video generation is asynchronous. When the agent calls video_generate in a session:

OpenClaw submits the request to the provider and immediately returns a task ID.
The provider processes the job in the background (typically 30 seconds to 5 minutes depending on the provider and resolution).
When the video is ready, OpenClaw wakes the same session with an internal completion event.
The agent posts the finished video back into the original conversation.

While a job is in flight, duplicate video_generate calls in the same session return the current task status instead of starting another generation. Use openclaw tasks list or openclaw tasks show <taskId> to check progress from the CLI.

Outside of session-backed agent runs (for example, direct tool invocations), the tool falls back to inline generation and returns the final media path in the same turn.

Task lifecycle

Each video_generate request moves through four states:

queued -- task created, waiting for the provider to accept it.
running -- provider is processing (typically 30 seconds to 5 minutes depending on provider and resolution).
succeeded -- video ready; the agent wakes and posts it to the conversation.
failed -- provider error or timeout; the agent wakes with error details.

Check status from the CLI:

openclaw tasks list
openclaw tasks show <taskId>
openclaw tasks cancel <taskId>

Duplicate prevention: if a video task is already queued or running for the current session, video_generate returns the existing task status instead of starting a new one. Use action: "status" to check explicitly without triggering a new generation.

Supported providers

Provider	Default model	Text	Image ref	Video ref	API key
Alibaba	`wan2.6-t2v`	Yes	Yes (remote URL)	Yes (remote URL)	`MODELSTUDIO_API_KEY`
BytePlus	`seedance-1-0-lite-t2v-250428`	Yes	1 image	No	`BYTEPLUS_API_KEY`
ComfyUI	`workflow`	Yes	1 image	No	`COMFY_API_KEY` or `COMFY_CLOUD_API_KEY`
fal	`fal-ai/minimax/video-01-live`	Yes	1 image	No	`FAL_KEY`
Google	`veo-3.1-fast-generate-preview`	Yes	1 image	1 video	`GEMINI_API_KEY`
MiniMax	`MiniMax-Hailuo-2.3`	Yes	1 image	No	`MINIMAX_API_KEY`
OpenAI	`sora-2`	Yes	1 image	1 video	`OPENAI_API_KEY`
Qwen	`wan2.6-t2v`	Yes	Yes (remote URL)	Yes (remote URL)	`QWEN_API_KEY`
Runway	`gen4.5`	Yes	1 image	1 video	`RUNWAYML_API_SECRET`
Together	`Wan-AI/Wan2.2-T2V-A14B`	Yes	1 image	No	`TOGETHER_API_KEY`
Vydra	`veo3`	Yes	1 image (`kling`)	No	`VYDRA_API_KEY`
xAI	`grok-imagine-video`	Yes	1 image	1 video	`XAI_API_KEY`

Some providers accept additional or alternate API key env vars. See individual provider pages for details.

Run video_generate action=list to inspect available providers, models, and runtime modes at runtime.

Declared capability matrix

This is the explicit mode contract used by video_generate, contract tests, and the shared live sweep.

Provider	`generate`	`imageToVideo`	`videoToVideo`	Shared live lanes today
Alibaba	Yes	Yes	Yes	`generate`, `imageToVideo`; `videoToVideo` skipped because this provider needs remote `http(s)` video URLs
BytePlus	Yes	Yes	No	`generate`, `imageToVideo`
ComfyUI	Yes	Yes	No	Not in the shared sweep; workflow-specific coverage lives with Comfy tests
fal	Yes	Yes	No	`generate`, `imageToVideo`
Google	Yes	Yes	Yes	`generate`, `imageToVideo`; shared `videoToVideo` skipped because the current buffer-backed Gemini/Veo sweep does not accept that input
MiniMax	Yes	Yes	No	`generate`, `imageToVideo`
OpenAI	Yes	Yes	Yes	`generate`, `imageToVideo`; shared `videoToVideo` skipped because this org/input path currently needs provider-side inpaint/remix access
Qwen	Yes	Yes	Yes	`generate`, `imageToVideo`; `videoToVideo` skipped because this provider needs remote `http(s)` video URLs
Runway	Yes	Yes	Yes	`generate`, `imageToVideo`; `videoToVideo` runs only when the selected model is `runway/gen4_aleph`
Together	Yes	Yes	No	`generate`, `imageToVideo`
Vydra	Yes	Yes	No	`generate`; shared `imageToVideo` skipped because bundled `veo3` is text-only and bundled `kling` requires a remote image URL
xAI	Yes	Yes	Yes	`generate`, `imageToVideo`; `videoToVideo` skipped because this provider currently needs a remote MP4 URL

Tool parameters

Required

Parameter	Type	Description
`prompt`	string	Text description of the video to generate (required for `action: "generate"`)

Content inputs

Parameter	Type	Description
`image`	string	Single reference image (path or URL)
`images`	string[]	Multiple reference images (up to 5)
`video`	string	Single reference video (path or URL)
`videos`	string[]	Multiple reference videos (up to 4)

Style controls

Parameter	Type	Description
`aspectRatio`	string	`1:1`, `2:3`, `3:2`, `3:4`, `4:3`, `4:5`, `5:4`, `9:16`, `16:9`, `21:9`
`resolution`	string	`480P`, `720P`, `768P`, or `1080P`
`durationSeconds`	number	Target duration in seconds (rounded to nearest provider-supported value)
`size`	string	Size hint when the provider supports it
`audio`	boolean	Enable generated audio when supported
`watermark`	boolean	Toggle provider watermarking when supported

Advanced

Parameter	Type	Description
`action`	string	`"generate"` (default), `"status"`, or `"list"`
`model`	string	Provider/model override (e.g. `runway/gen4.5`)
`filename`	string	Output filename hint

Not all providers support all parameters. OpenClaw already normalizes duration to the closest provider-supported value, and it also remaps translated geometry hints such as size-to-aspect-ratio when a fallback provider exposes a different control surface. Truly unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission.

Reference inputs also select the runtime mode:

No reference media: generate
Any image reference: imageToVideo
Any video reference: videoToVideo

Mixed image and video references are not a stable shared capability surface. Prefer one reference type per request.

Actions

generate (default) -- create a video from the given prompt and optional reference inputs.
status -- check the state of the in-flight video task for the current session without starting another generation.
list -- show available providers, models, and their capabilities.

Model selection

When generating a video, OpenClaw resolves the model in this order:

model tool parameter -- if the agent specifies one in the call.
videoGenerationModel.primary -- from config.
videoGenerationModel.fallbacks -- tried in order.
Auto-detection -- uses providers that have valid auth, starting with the current default provider, then remaining providers in alphabetical order.

If a provider fails, the next candidate is tried automatically. If all candidates fail, the error includes details from each attempt.

Set agents.defaults.mediaGenerationAutoProviderFallback: false if you want video generation to use only the explicit model, primary, and fallbacks entries.

{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "google/veo-3.1-fast-generate-preview",
        fallbacks: ["runway/gen4.5", "qwen/wan2.6-t2v"],
      },
    },
  },
}

Provider notes

Provider	Notes
Alibaba	Uses DashScope/Model Studio async endpoint. Reference images and videos must be remote `http(s)` URLs.
BytePlus	Single image reference only.
ComfyUI	Workflow-driven local or cloud execution. Supports text-to-video and image-to-video through the configured graph.
fal	Uses queue-backed flow for long-running jobs. Single image reference only.
Google	Uses Gemini/Veo. Supports one image or one video reference.
MiniMax	Single image reference only.
OpenAI	Only `size` override is forwarded. Other style overrides (`aspectRatio`, `resolution`, `audio`, `watermark`) are ignored with a warning.
Qwen	Same DashScope backend as Alibaba. Reference inputs must be remote `http(s)` URLs; local files are rejected upfront.
Runway	Supports local files via data URIs. Video-to-video requires `runway/gen4_aleph`. Text-only runs expose `16:9` and `9:16` aspect ratios.
Together	Single image reference only.
Vydra	Uses `https://www.vydra.ai/api/v1` directly to avoid auth-dropping redirects. `veo3` is bundled as text-to-video only; `kling` requires a remote image URL.
xAI	Supports text-to-video, image-to-video, and remote video edit/extend flows.

Provider capability modes

The shared video-generation contract now lets providers declare mode-specific capabilities instead of only flat aggregate limits. New provider implementations should prefer explicit mode blocks:

capabilities: {
  generate: {
    maxVideos: 1,
    maxDurationSeconds: 10,
    supportsResolution: true,
  },
  imageToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputImages: 1,
    maxDurationSeconds: 5,
  },
  videoToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputVideos: 1,
    maxDurationSeconds: 5,
  },
}

Flat aggregate fields such as maxInputImages and maxInputVideos are not enough to advertise transform-mode support. Providers should declare generate, imageToVideo, and videoToVideo explicitly so live tests, contract tests, and the shared video_generate tool can validate mode support deterministically.

Live tests

Opt-in live coverage for the shared bundled providers:

OPENCLAW_LIVE_TEST=1 pnpm test:live -- extensions/video-generation-providers.live.test.ts

Repo wrapper:

pnpm test:live:media video

This live file loads missing provider env vars from ~/.profile, prefers live/env API keys ahead of stored auth profiles by default, and runs the declared modes it can exercise safely with local media:

generate for every provider in the sweep
imageToVideo when capabilities.imageToVideo.enabled
videoToVideo when capabilities.videoToVideo.enabled and the provider/model accepts buffer-backed local video input in the shared sweep

Today the shared videoToVideo live lane covers:

runway only when you select runway/gen4_aleph

Configuration

Set the default video generation model in your OpenClaw config:

{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "qwen/wan2.6-t2v",
        fallbacks: ["qwen/wan2.6-r2v-flash"],
      },
    },
  },
}

Or via the CLI:

openclaw config set agents.defaults.videoGenerationModel.primary "qwen/wan2.6-t2v"

Tools Overview
Background Tasks -- task tracking for async video generation
Alibaba Model Studio
BytePlus
ComfyUI
fal
Google (Gemini)
MiniMax
OpenAI
Qwen
Runway
Together AI
Vydra
xAI
Configuration Reference
Models

17 KiB Raw Blame History