CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi^1,2,*, Jiepeng Wang^2,*,‡, Yuanzhi Liang², Xi Qiu², Jialun Liu², Hao Pan³, Yuchi Huo¹, Rui Wang^1,†, Haibin Huang², Chi Zhang², Xuelong Li^2,†

*Equal contribution. †Corresponding author. ‡Project leader.

¹State Key Laboratory of CAD&CG, Zhejiang University
²Institute of Artificial Intelligence, China Telecom (TeleAI)
³Tsinghua University

📄 Paper · 🌐 Project Page

📌 Intro

CtrlVDiff unifies forward and inverse video generation within a single model, enabling the extraction of all modalities in a single pass. It provides layer-wise control over appearance and structure, facilitating applications such as material editing and object insertion.

✨ Highlights

Unified Video Framework: A single model supports both forward and inverse video generation. It can function as a renderer to synthesize videos, and as a decomposer to extract all multimodal representations in just one forward pass.
Layer-wise Control Strategy: To enable a unified model to flexibly handle arbitrary combinations and numbers of input modalities, we introduce a Hybrid Modality Control Strategy (HMCS), which provides hierarchical control over video generation across geometry, appearance, structure, and semantics.
MMVideo Dataset: To support the scale and diversity required for this task, we construct the MMVideo dataset, which includes both real-world and synthetic scenes. It contains 350k video clips paired with rich multimodal annotations, enabling high-quality video generation and decomposition across diverse domains.

📜 Citation

@misc{xdb2025ctrlvdiff,
      title={CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion}, 
      author={Dianbing Xi and Jiepeng Wang and Yuanzhi Liang and Xi Qiu and Jialun Liu and Hao Pan and Yuchi Huo and Rui Wang and Haibin Huang and Chi Zhang and Xuelong Li},
      year={2025},
      eprint={2511.21129},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21129}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

📌 Intro

✨ Highlights

📜 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

License

Tele-AI/CtrlVDiff

Folders and files

Latest commit

History

Repository files navigation

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

📌 Intro

✨ Highlights

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages