docs(video): describe mode-aware generation capabilities

This commit is contained in:
Vincent Koc
2026-04-06 13:25:45 +01:00
parent 45875ed532
commit d7086526b0
2 changed files with 72 additions and 4 deletions

View File

@@ -592,9 +592,20 @@ API key auth, and dynamic model resolution.
id: "acme-ai",
label: "Acme Video",
capabilities: {
maxVideos: 1,
maxDurationSeconds: 10,
supportsResolution: true,
generate: {
maxVideos: 1,
maxDurationSeconds: 10,
supportsResolution: true,
},
imageToVideo: {
enabled: true,
maxVideos: 1,
maxInputImages: 1,
maxDurationSeconds: 5,
},
videoToVideo: {
enabled: false,
},
},
generateVideo: async (req) => ({ videos: [] }),
});
@@ -631,6 +642,12 @@ API key auth, and dynamic model resolution.
recommended pattern for company plugins (one plugin per vendor). See
[Internals: Capability Ownership](/plugins/architecture#capability-ownership-model).
For video generation, prefer the mode-aware capability shape shown above:
`generate`, `imageToVideo`, and `videoToVideo`. The older flat fields such
as `maxInputImages`, `maxInputVideos`, and `maxDurationSeconds` still work
as aggregate fallback caps, but they cannot describe per-mode limits or
disabled transform modes as cleanly.
</Step>
<Step title="Test">

View File

@@ -15,6 +15,15 @@ OpenClaw agents can generate videos from text prompts, reference images, or exis
The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`.
</Note>
OpenClaw treats video generation as three runtime modes:
- `generate` for text-to-video requests with no reference media
- `imageToVideo` when the request includes one or more reference images
- `videoToVideo` when the request includes one or more reference videos
Providers can support any subset of those modes. The tool validates the active
mode before submission and reports supported modes in `action=list`.
## Quick start
1. Set an API key for any supported provider:
@@ -67,7 +76,8 @@ Outside of session-backed agent runs (for example, direct tool invocations), the
Some providers accept additional or alternate API key env vars. See individual [provider pages](#related) for details.
Run `video_generate action=list` to inspect available providers and models at runtime.
Run `video_generate action=list` to inspect available providers, models, and
runtime modes at runtime.
## Tool parameters
@@ -107,6 +117,15 @@ Run `video_generate action=list` to inspect available providers and models at ru
Not all providers support all parameters. Unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission.
Reference inputs also select the runtime mode:
- No reference media: `generate`
- Any image reference: `imageToVideo`
- Any video reference: `videoToVideo`
Mixed image and video references are not a stable shared capability surface.
Prefer one reference type per request.
## Actions
- **generate** (default) -- create a video from the given prompt and optional reference inputs.
@@ -154,6 +173,38 @@ If a provider fails, the next candidate is tried automatically. If all candidate
| Vydra | Uses `https://www.vydra.ai/api/v1` directly to avoid auth-dropping redirects. `veo3` is bundled as text-to-video only; `kling` requires a remote image URL. |
| xAI | Supports text-to-video, image-to-video, and remote video edit/extend flows. |
## Provider capability modes
The shared video-generation contract now lets providers declare mode-specific
capabilities instead of only flat aggregate limits. New provider
implementations should prefer explicit mode blocks:
```typescript
capabilities: {
generate: {
maxVideos: 1,
maxDurationSeconds: 10,
supportsResolution: true,
},
imageToVideo: {
enabled: true,
maxVideos: 1,
maxInputImages: 1,
maxDurationSeconds: 5,
},
videoToVideo: {
enabled: true,
maxVideos: 1,
maxInputVideos: 1,
maxDurationSeconds: 5,
},
}
```
Legacy flat fields such as `maxInputImages` and `maxInputVideos` still work as
backward-compatible aggregate caps, but they cannot express per-mode limits as
precisely.
## Configuration
Set the default video generation model in your OpenClaw config: