From d7086526b0889ef07dec29ae2c074bbe01f9ee36 Mon Sep 17 00:00:00 2001
From: Vincent Koc <vincentkoc@ieee.org>
Date: Mon, 6 Apr 2026 13:25:45 +0100
Subject: [PATCH] docs(video): describe mode-aware generation capabilities

---
 docs/plugins/sdk-provider-plugins.md | 23 ++++++++++--
 docs/tools/video-generation.md       | 53 +++++++++++++++++++++++++++-
 2 files changed, 72 insertions(+), 4 deletions(-)
diff --git a/docs/plugins/sdk-provider-plugins.md b/docs/plugins/sdk-provider-plugins.md
index 8b3e4c937f3..dfeaed8a654 100644
--- a/docs/plugins/sdk-provider-plugins.md
+++ b/docs/plugins/sdk-provider-plugins.md
@@ -592,9 +592,20 @@ API key auth, and dynamic model resolution.
         id: "acme-ai",
         label: "Acme Video",
         capabilities: {
-          maxVideos: 1,
-          maxDurationSeconds: 10,
-          supportsResolution: true,
+          generate: {
+            maxVideos: 1,
+            maxDurationSeconds: 10,
+            supportsResolution: true,
+          },
+          imageToVideo: {
+            enabled: true,
+            maxVideos: 1,
+            maxInputImages: 1,
+            maxDurationSeconds: 5,
+          },
+          videoToVideo: {
+            enabled: false,
+          },
         },
         generateVideo: async (req) => ({ videos: [] }),
       });
@@ -631,6 +642,12 @@ API key auth, and dynamic model resolution.
     recommended pattern for company plugins (one plugin per vendor). See
     [Internals: Capability Ownership](/plugins/architecture#capability-ownership-model).
 
+    For video generation, prefer the mode-aware capability shape shown above:
+    `generate`, `imageToVideo`, and `videoToVideo`. The older flat fields such
+    as `maxInputImages`, `maxInputVideos`, and `maxDurationSeconds` still work
+    as aggregate fallback caps, but they cannot describe per-mode limits or
+    disabled transform modes as cleanly.
+
   </Step>
 
   <Step title="Test">
diff --git a/docs/tools/video-generation.md b/docs/tools/video-generation.md
index b0707e71ead..04aafead8cc 100644
--- a/docs/tools/video-generation.md
+++ b/docs/tools/video-generation.md
@@ -15,6 +15,15 @@ OpenClaw agents can generate videos from text prompts, reference images, or exis
 The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`.
 </Note>
 
+OpenClaw treats video generation as three runtime modes:
+
+- `generate` for text-to-video requests with no reference media
+- `imageToVideo` when the request includes one or more reference images
+- `videoToVideo` when the request includes one or more reference videos
+
+Providers can support any subset of those modes. The tool validates the active
+mode before submission and reports supported modes in `action=list`.
+
 ## Quick start
 
 1. Set an API key for any supported provider:
@@ -67,7 +76,8 @@ Outside of session-backed agent runs (for example, direct tool invocations), the
 
 Some providers accept additional or alternate API key env vars. See individual [provider pages](#related) for details.
 
-Run `video_generate action=list` to inspect available providers and models at runtime.
+Run `video_generate action=list` to inspect available providers, models, and
+runtime modes at runtime.
 
 ## Tool parameters
 
@@ -107,6 +117,15 @@ Run `video_generate action=list` to inspect available providers and models at ru
 
 Not all providers support all parameters. Unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission.
 
+Reference inputs also select the runtime mode:
+
+- No reference media: `generate`
+- Any image reference: `imageToVideo`
+- Any video reference: `videoToVideo`
+
+Mixed image and video references are not a stable shared capability surface.
+Prefer one reference type per request.
+
 ## Actions
 
 - **generate** (default) -- create a video from the given prompt and optional reference inputs.
@@ -154,6 +173,38 @@ If a provider fails, the next candidate is tried automatically. If all candidate
 | Vydra    | Uses `https://www.vydra.ai/api/v1` directly to avoid auth-dropping redirects. `veo3` is bundled as text-to-video only; `kling` requires a remote image URL. |
 | xAI      | Supports text-to-video, image-to-video, and remote video edit/extend flows.                                                                                 |
 
+## Provider capability modes
+
+The shared video-generation contract now lets providers declare mode-specific
+capabilities instead of only flat aggregate limits. New provider
+implementations should prefer explicit mode blocks:
+
+```typescript
+capabilities: {
+  generate: {
+    maxVideos: 1,
+    maxDurationSeconds: 10,
+    supportsResolution: true,
+  },
+  imageToVideo: {
+    enabled: true,
+    maxVideos: 1,
+    maxInputImages: 1,
+    maxDurationSeconds: 5,
+  },
+  videoToVideo: {
+    enabled: true,
+    maxVideos: 1,
+    maxInputVideos: 1,
+    maxDurationSeconds: 5,
+  },
+}
+```
+
+Legacy flat fields such as `maxInputImages` and `maxInputVideos` still work as
+backward-compatible aggregate caps, but they cannot express per-mode limits as
+precisely.
+
 ## Configuration
 
 Set the default video generation model in your OpenClaw config: