Seedance 2.0 Multimodal Input System | Seedance 2

Seedance 2.0 is the only video generation model that accepts all four input modalities simultaneously. This section breaks down each input type, the @tag reference control system, output specifications, and the built in editing tools.

Input modalities

Text PromptsDescriptive prompts guiding the scene, action, style, and composition of the output video. The foundation of every generation request.

Images (up to 9)Reference images for characters, backgrounds, style guides, or any visual material you want the model to incorporate into the output.

Video Clips (up to 3)Reference clips totaling up to 15 seconds. Useful for motion references, style templates, or continuation footage.

Audio Files (up to 3)Audio files totaling up to 15 seconds. The model synchronizes the generated video to the audio, including lip sync for speech.

Maximum reference files per generation

Combine up to 12 reference files across all modalities (images, video clips, and audio files) in a single generation request.

The @Tag reference control system

Assign each reference file a tag (e.g., @speaker, @background, @style) and reference those tags in your text prompt. Example: upload a photo tagged @speaker, audio tagged @dialogue, and a landscape tagged @setting, then prompt: "@speaker stands in @setting and delivers @dialogue with dramatic lighting." The model uses each reference for its designated purpose rather than blending them indiscriminately.

Resolution: Up to 2K

Duration: 4 to 15 seconds per clip

Frame rate: 24 FPS

Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4

Editing tools

ExtendLengthen an existing clip beyond its original duration.

MergeCombine multiple generated clips into a seamless sequence.

RestyleApply a new visual style or aesthetic to existing footage.

🔄

Character SwapReplace a character in a generated video with a different reference.