Pytorch implementation of TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models
Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.
The framework generates video frames based on the provided tracklets and employs the Instance Enhancer to reinforce the temporal consistency of foreground instance. A new gated cross-attention layer is inserted to take in the new instance information.
The code is tested with Pytorch==2.0.1 and cuda 11.8 on A800 servers. To setup the python environment, follow:
cd ${ROOT}
pip install -r requirements.txt
Then, continue to install third_party
requirements, follow:
pip install https://download.openmmlab.com/mmcv/dist/cu117/torch2.0.0/mmcv-2.0.0-cp310-cp310-manylinux1_x86_64.whl
git clone https://github.com/open-mmlab/mmtracking.git -b dev-1.x
cd mmtracking
pip install -e .
cd third_party/diffusers
pip install -e .
Please download the datasets from the official websites. YouTube-VIS 2021
We also provide the text caption files for the ytvis dataset, please download from Google Drive.
T2V Version | Stable Video Diffusion Version |
---|---|
Our training are based on pengxiang/GLIGEN_1_4 . You can access the following links to obtain the trained weights:weight |
Our training are based on stabilityai/stable-video-diffusion-img2vid . You can access the following links to obtain weights for stage 1 and 2:Stage1 Stage2 |
We use CocoVID to maintain all datasets in this codebase. In this case, you need to convert the official annotations to this style. We provide scripts and the usages are as following:
cd ./third_party/mmtracking
python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021
The folder structure will be as following after your run these scripts:
├── data
│ ├── youtube_vis_2021
│ │ │── train
│ │ │ │── JPEGImages
│ │ │ │── instances.json (the official annotation files)
│ │ │ │── ......
│ │ │── valid
│ │ │ │── JPEGImages
│ │ │ │── instances.json (the official annotation files)
│ │ │ │── ......
│ │ │── test
│ │ │ │── JPEGImages
│ │ │ │── instances.json (the official annotation files)
│ │ │ │── ......
│ │ │── annotations (the converted annotation file)
Launch training with (with 8xA800):
If you encounter an error similar to AssertionError: MMEngine==0.10.3 is used but incompatible. Please install mmengine>=0.0.0, <0.2.0.
, please directly jump to that line of code and comment it out.
bash ./scripts/t2v.sh
✨If you want SVD version, please find the code at the SVD branch.
Check demo.ipynb
for more details.
- Compare TrackDiffusion with other methods for generation quality:
- Training support with frames generated from TrackDiffusion:
More results can be found in the main paper.
We aim to construct a controllable and flexible pipeline for perception data corner case generation and visual world modeling! Check our latest works:
- GeoDiffusion: text-prompted geometric controls for 2D object detection.
- MagicDrive: multi-view street scene generation for 3D object detection.
- TrackDiffusion: multi-object video generation for MOT tracking.
- DetDiffusion: customized corner case generation.
- Geom-Erasing: geometric controls for implicit concept removal.
@misc{li2024trackdiffusion,
title={TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models},
author={Pengxiang Li and Kai Chen and Zhili Liu and Ruiyuan Gao and Lanqing Hong and Guo Zhou and Hua Yao and Dit-Yan Yeung and Huchuan Lu and Xu Jia},
year={2024},
eprint={2312.00651},
archivePrefix={arXiv},
primaryClass={cs.CV}
}