From d7086526b0889ef07dec29ae2c074bbe01f9ee36 Mon Sep 17 00:00:00 2001 From: Vincent Koc Date: Mon, 6 Apr 2026 13:25:45 +0100 Subject: [PATCH] docs(video): describe mode-aware generation capabilities --- docs/plugins/sdk-provider-plugins.md | 23 ++++++++++-- docs/tools/video-generation.md | 53 +++++++++++++++++++++++++++- 2 files changed, 72 insertions(+), 4 deletions(-) diff --git a/docs/plugins/sdk-provider-plugins.md b/docs/plugins/sdk-provider-plugins.md index 8b3e4c937f3..dfeaed8a654 100644 --- a/docs/plugins/sdk-provider-plugins.md +++ b/docs/plugins/sdk-provider-plugins.md @@ -592,9 +592,20 @@ API key auth, and dynamic model resolution. id: "acme-ai", label: "Acme Video", capabilities: { - maxVideos: 1, - maxDurationSeconds: 10, - supportsResolution: true, + generate: { + maxVideos: 1, + maxDurationSeconds: 10, + supportsResolution: true, + }, + imageToVideo: { + enabled: true, + maxVideos: 1, + maxInputImages: 1, + maxDurationSeconds: 5, + }, + videoToVideo: { + enabled: false, + }, }, generateVideo: async (req) => ({ videos: [] }), }); @@ -631,6 +642,12 @@ API key auth, and dynamic model resolution. recommended pattern for company plugins (one plugin per vendor). See [Internals: Capability Ownership](/plugins/architecture#capability-ownership-model). + For video generation, prefer the mode-aware capability shape shown above: + `generate`, `imageToVideo`, and `videoToVideo`. The older flat fields such + as `maxInputImages`, `maxInputVideos`, and `maxDurationSeconds` still work + as aggregate fallback caps, but they cannot describe per-mode limits or + disabled transform modes as cleanly. + diff --git a/docs/tools/video-generation.md b/docs/tools/video-generation.md index b0707e71ead..04aafead8cc 100644 --- a/docs/tools/video-generation.md +++ b/docs/tools/video-generation.md @@ -15,6 +15,15 @@ OpenClaw agents can generate videos from text prompts, reference images, or exis The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`. +OpenClaw treats video generation as three runtime modes: + +- `generate` for text-to-video requests with no reference media +- `imageToVideo` when the request includes one or more reference images +- `videoToVideo` when the request includes one or more reference videos + +Providers can support any subset of those modes. The tool validates the active +mode before submission and reports supported modes in `action=list`. + ## Quick start 1. Set an API key for any supported provider: @@ -67,7 +76,8 @@ Outside of session-backed agent runs (for example, direct tool invocations), the Some providers accept additional or alternate API key env vars. See individual [provider pages](#related) for details. -Run `video_generate action=list` to inspect available providers and models at runtime. +Run `video_generate action=list` to inspect available providers, models, and +runtime modes at runtime. ## Tool parameters @@ -107,6 +117,15 @@ Run `video_generate action=list` to inspect available providers and models at ru Not all providers support all parameters. Unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission. +Reference inputs also select the runtime mode: + +- No reference media: `generate` +- Any image reference: `imageToVideo` +- Any video reference: `videoToVideo` + +Mixed image and video references are not a stable shared capability surface. +Prefer one reference type per request. + ## Actions - **generate** (default) -- create a video from the given prompt and optional reference inputs. @@ -154,6 +173,38 @@ If a provider fails, the next candidate is tried automatically. If all candidate | Vydra | Uses `https://www.vydra.ai/api/v1` directly to avoid auth-dropping redirects. `veo3` is bundled as text-to-video only; `kling` requires a remote image URL. | | xAI | Supports text-to-video, image-to-video, and remote video edit/extend flows. | +## Provider capability modes + +The shared video-generation contract now lets providers declare mode-specific +capabilities instead of only flat aggregate limits. New provider +implementations should prefer explicit mode blocks: + +```typescript +capabilities: { + generate: { + maxVideos: 1, + maxDurationSeconds: 10, + supportsResolution: true, + }, + imageToVideo: { + enabled: true, + maxVideos: 1, + maxInputImages: 1, + maxDurationSeconds: 5, + }, + videoToVideo: { + enabled: true, + maxVideos: 1, + maxInputVideos: 1, + maxDurationSeconds: 5, + }, +} +``` + +Legacy flat fields such as `maxInputImages` and `maxInputVideos` still work as +backward-compatible aggregate caps, but they cannot express per-mode limits as +precisely. + ## Configuration Set the default video generation model in your OpenClaw config: