docs(video): describe mode-aware generation capabilities

2026-04-09 19:04:30 +02:00 · 2026-04-06 13:25:45 +01:00
parent 45875ed532
commit d7086526b0
2 changed files with 72 additions and 4 deletions
--- a/docs/plugins/sdk-provider-plugins.md
+++ b/docs/plugins/sdk-provider-plugins.md
@@ -592,9 +592,20 @@ API key auth, and dynamic model resolution.
        id: "acme-ai",
        label: "Acme Video",
        capabilities: {
-          maxVideos: 1,
-          maxDurationSeconds: 10,
-          supportsResolution: true,
+          generate: {
+            maxVideos: 1,
+            maxDurationSeconds: 10,
+            supportsResolution: true,
+          },
+          imageToVideo: {
+            enabled: true,
+            maxVideos: 1,
+            maxInputImages: 1,
+            maxDurationSeconds: 5,
+          },
+          videoToVideo: {
+            enabled: false,
+          },
        },
        generateVideo: async (req) => ({ videos: [] }),
      });
@@ -631,6 +642,12 @@ API key auth, and dynamic model resolution.
    recommended pattern for company plugins (one plugin per vendor). See
    [Internals: Capability Ownership](/plugins/architecture#capability-ownership-model).

+    For video generation, prefer the mode-aware capability shape shown above:
+    `generate`, `imageToVideo`, and `videoToVideo`. The older flat fields such
+    as `maxInputImages`, `maxInputVideos`, and `maxDurationSeconds` still work
+    as aggregate fallback caps, but they cannot describe per-mode limits or
+    disabled transform modes as cleanly.
+
  </Step>

  <Step title="Test">
--- a/docs/tools/video-generation.md
+++ b/docs/tools/video-generation.md
@@ -15,6 +15,15 @@ OpenClaw agents can generate videos from text prompts, reference images, or exis
 The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`.
 </Note>

+OpenClaw treats video generation as three runtime modes:
+
+- `generate` for text-to-video requests with no reference media
+- `imageToVideo` when the request includes one or more reference images
+- `videoToVideo` when the request includes one or more reference videos
+
+Providers can support any subset of those modes. The tool validates the active
+mode before submission and reports supported modes in `action=list`.
+
 ## Quick start

 1. Set an API key for any supported provider:
@@ -67,7 +76,8 @@ Outside of session-backed agent runs (for example, direct tool invocations), the

 Some providers accept additional or alternate API key env vars. See individual [provider pages](#related) for details.

-Run `video_generate action=list` to inspect available providers and models at runtime.
+Run `video_generate action=list` to inspect available providers, models, and
+runtime modes at runtime.

 ## Tool parameters

@@ -107,6 +117,15 @@ Run `video_generate action=list` to inspect available providers and models at ru

 Not all providers support all parameters. Unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission.

+Reference inputs also select the runtime mode:
+
+- No reference media: `generate`
+- Any image reference: `imageToVideo`
+- Any video reference: `videoToVideo`
+
+Mixed image and video references are not a stable shared capability surface.
+Prefer one reference type per request.
+
 ## Actions

 - **generate** (default) -- create a video from the given prompt and optional reference inputs.
@@ -154,6 +173,38 @@ If a provider fails, the next candidate is tried automatically. If all candidate
 | Vydra    | Uses `https://www.vydra.ai/api/v1` directly to avoid auth-dropping redirects. `veo3` is bundled as text-to-video only; `kling` requires a remote image URL. |
 | xAI      | Supports text-to-video, image-to-video, and remote video edit/extend flows.                                                                                 |

+## Provider capability modes
+
+The shared video-generation contract now lets providers declare mode-specific
+capabilities instead of only flat aggregate limits. New provider
+implementations should prefer explicit mode blocks:
+
+```typescript
+capabilities: {
+  generate: {
+    maxVideos: 1,
+    maxDurationSeconds: 10,
+    supportsResolution: true,
+  },
+  imageToVideo: {
+    enabled: true,
+    maxVideos: 1,
+    maxInputImages: 1,
+    maxDurationSeconds: 5,
+  },
+  videoToVideo: {
+    enabled: true,
+    maxVideos: 1,
+    maxInputVideos: 1,
+    maxDurationSeconds: 5,
+  },
+}
+```
+
+Legacy flat fields such as `maxInputImages` and `maxInputVideos` still work as
+backward-compatible aggregate caps, but they cannot express per-mode limits as
+precisely.
+
 ## Configuration

 Set the default video generation model in your OpenClaw config: