ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

1 Department of Computer Science and Technology, Tsinghua University 2 Beijing National Research Center for Information Science and Technology 3 MoE Key Lab of High Confidence Software Technologies, Peking University

Abstract

With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weakness: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations.

Framework

framework

The framework of ScenarioDiff. Our proposed method contains three main stages. The first stage is tuning the UNet with simulated videos which aims to learn the knowledge that how to change the scenes. The second stage is training the Mixed Frequency ControlNet which makes our method get the ability to generate long videos. During inference stage, we propose a cross chunk scheduling mechanism which utilizes the information of the previous chunk to make the video more consistency.

Some Cases

A person is riding a horse under the mountain from right to left.
A bird flies around many branches.
Some moose are walking in a meadow under the aurora borealis.
A duck is swimming in the pool and then walks ashore.
A tornado swept through, sending two cars into the sky.
A boy kicks a ball into a pool of water and a dog runs over to it.
A dog is running from grass to beach.
A cloud drifted in from the left and was blown to pieces by the wind after a while.
A cat is running from a rock to street.
A dog is running from the wheat field to a lake.
A tree in the wind.
A clownfish goes up and down in the sea.

Ablations

Qualitative results when ScenarioDiff with and without tuning on the simulated dataset.

without finetuning

(a) without tuning

without finetuning

(b) with tuning

Qualitative results when ScenarioDiff with and without the components of mixed frequency controlnet.

without both

(a) without both

without mix

(b) without mixed init

without refiner

(c) without refiner

with both

(d) with all

Qualitative results of different \( \hat{\lambda} \)

lambda=0

(a) \( \hat{\lambda} \) = 0.00

lambda=0

(b) \( \hat{\lambda} \) = 0.25

lambda=0

(c) \( \hat{\lambda} \) = 0.50

lambda=0

(d) \( \hat{\lambda} \) = 0.75

lambda=0

(e) \( \hat{\lambda} \) = 1.00

Qualitative results when ScenarioDiff with and without the components of cross-chunk scheduling.

without both

(a) without both

without mix

(b) without mixed init

without reuse

(c) without reuse

with all

(d) with all