Searching Priors Makes Text-to-Video Synthesis Better

Bees buzzing around blooming flowers.	A dog eating popcorn in a movie theater.	Two people are playing guitar.
Water pouring into a glass.	Waves crashing against the rocks.	A rainbow flag waving in the morning breeze.

Abstract

Significant advancements in video diffusion models have brought substantial progress to the field of text-to-video (T2V) synthesis. However, existing T2V synthesis model struggle to accurately generate complex motion dynamics, leading to a reduction in video realism. One possible solution is to collect massive data and train the model on it, but this would be extremely expensive. To alleviate this problem, in this paper, we reformulate the typical T2V generation process as a search-based generation pipeline. Instead of scaling up the model training, we employ existing videos as the motion prior database. Specifically, we divide T2V generation process into two steps: (i) For a given prompt input, we search existing text-video datasets to find videos with text labels that closely match the prompt motions. We propose a tailored search algorithm that emphasizes object motion features. (ii) Retrieved videos are processed and distilled into motion priors to fine-tune a pre-trained base T2V model, followed by generating desired videos using input prompt. By utilizing the priors gleaned from the searched videos, we enhance the realism of the generated videos' motion. All operations can be finished on a single NVIDIA RTX 4090 GPU. We validate our method against state-of-the-art T2V models across diverse prompt inputs. The code will be public.

Overview

Pipeline Overview. This pipeline searches for videos with similar text labels, and extracts relevant information to fine-tune a pre-trained VDM for video generation. Given an input text prompt and a large-scale text-video dataset, our goal is to generate a high-quality video by leveraging the rich dataset information. To achieve this, we propose a search-based pipeline, which consists of two main steps:

Video Retrieval: For a given prompt input, we first conduct text vectorization to abstract it into semantic vectors. Then, we conduct a matching process. In this process, we utilize the extracted vectors to select a text-video pair from the dataset, which has the most similar motion semantics to the input prompt. The video of the selected text-video pair will be served as reference video.
Tuning and Synthesis: Once the retrieved video is obtained, we perform motion extraction to get the most representative visual information from the original video, and distill them to fine-tune a pre-trained video synthesis model. Finally, the input prompt is fed into the tuned model to generate the final result.

Comparison with SOTA

We present qualitative comparisons of our method against SOTA baselines including VideoLDM, Make-A-Video, and PYoCo. Since these outstanding works have not made their models or code public, we use the authors' publicly released samples for comparison.

Prompt: A dog swimming
Video LDM	Ours

Prompt: Fireworks
Video LDM	Ours

Prompt: A knight riding on a horse through the countryside
Make-A-Video	Ours

Prompt: Clown fish swimming through the coral reef
Make-A-Video	Ours

Prompt: A video of milk pouring over raspberry and blackberries.
PYoCo	Ours

Prompt: A cute rabbit is eating grass, wildlife photography.
PYoCo	Ours

Followings are comparisons of our method against CogVideo, Show-1, ZeroScope_v2, and AnimateDiff-Lightning. All the results are generated by the officially released models.

Prompt: A happy dog running in a park.

Prompt: Woman running on the beach at sunrise.

Abaltion Analysis

We present the ablation analysis of video retrieval module and motion extraction algorithm as following.

Prompt: A tiger is eating grass.
Baseline + no prior	Baseline + random prior
Baseline + retrieved prior + no keyfreme extraction	Baseline + retrieved prior + keyfreme extraction

We perform several ablation experiments on different models of the proposed pipeline. The generate result are presented above. In the image, we can observe the following:

Without adding any prior to the initial diffusion baseline, the video is almost static, and the tiger does not exhibit the eating action.
When we add a prior to the model but remove the video retrieval module, algorithm randomly chooses a video from the database as a reference (e.g., the static train carriage scene shown). Even after distilling the dynamic information from this video and adding it as a prior, the model still fails to generate the eating motion.
When we add a prior and enable the tailored video matching algorithm, we obtain videos related to the desired action (e.g., the tiger eating meat shown). However, if we do not use the keyframe extraction algorithm, the selected keyframes would contain a lot of irrelevant visual content (e.g., the tiger obscured by leaves in the last three frames). This still does not help the model generate the desired motion.
Finally, when we enable all modules, we obtain appropriate video keyframes. After adding this prior to the model, the generated tiger finally begins to chew the grass.

We present the ablation analysis of different size of dataset as following.

Prompt: A Lamborghini is speeding around dreamy clouds.
Result based on 5%~25% size of original dataset.	Result based on 50%~100% size of original dataset.

Prompt: An elephant is walking under the sea.
Result based on 25% size of original dataset.	Result based on 50%% size of original dataset.

The search results and generation results are visualized above. From these videos, it can be seen that when the data scale is small, the generator's performance would be negatively affected. For example, the left video of the Lamborghini case does not show the dynamic of the car driving well; the left video of the elephant case confuses the elephant's limbs.

More Generated Smaples

A cute girl looks at the beautiful nature through the window.	Waterfalls falling down the cliff.	In the afterglow of the sunset, the river flows towards the distance.
Spaceman riding motorcycle with galaxy in background.	Coins falling into a piggy bank.	Bubbles rising in a glass of soda.
Hot air balloons rising over the mountains.	Lightning flashing across the stormy sky.	"Raindrops falling on a window.