GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning (ECCV 2024)
This repository contains the official implementation of GenView: Enhancing View Quality with Pretrained Generative Models for Self-Supervised Learning, presented at ECCV 2024.
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
Xiaojie Li^1,2, Yibo Yang^3, Xiangtai Li^4, Jianlong Wu^1, Yue Yu^2, Bernard Ghanem^3, Min Zhang^1
^1Harbin Institute of Technology (Shenzhen), ^2Peng Cheng Laboratory, ^3King Abdullah University of Science and Technology, ^4Nanyang Technological University
Follow the steps below to set up the environment and install dependencies.
Create a new Conda environment with Python 3.8 and activate it:
conda create --name env_genview python=3.8 -y
conda activate env_genview
You can install PyTorch, torchvision, and other dependencies via pip or Conda. Choose the command based on your preference and GPU compatibility.
# Using pip
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
# Or using conda
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
# Install additional dependencies
pip install timm==0.9.7 open_clip_torch==2.22.0 diffusers==0.21.4 huggingface_hub==0.17.3 transformers==4.33.3
Clone the GenView repository and install the required dependencies using openmim
:
git clone https://github.com/xiaojieli0903/genview.git
cd genview
pip install -U openmim
mim install -e .
Apply modifications to open_clip
and timm
:
sh tools/toolbox_genview/change_openclip_timm.sh
We utilize the pretrained CLIP ViT-H/14 backbone, which serves as the conditional image encoder in Stable UnCLIP v2-1, to determine the proportion of foreground content before image generation. This backbone processes an input resolution of (224 \times 224) and generates 256 tokens, each with a dimension of 1280.
For calculating PCA features needed for foreground-background separation, we randomly sample 10,000 images from the original dataset. The threshold ( \alpha ) in Equation (7) is selected to ensure that foreground tokens account for approximately 40% of the total tokens, providing a clear separation between foreground and background.
We first extract features from 10,000 images using the CLIP ViT-H/14 backbone and then perform PCA analysis. The calculated PCA vectors act as classifiers for distinguishing between foreground and background regions.
Command to Extract Features and Perform PCA Analysis:
python tools/clip_pca/extract_features_pca.py \
--input-list tools/clip_pca/train_sampled_1000cls_10img.txt \
--num-extract 10000 \
--patch-size 14 \
--num-vis 20 \
--model ViT-H-14 \
--training-data laion2b_s32b_b79k
--input-list
: Path to the file containing the list of sampled images (tools/clip_pca/train_sampled_1000cls_10img.txt
).--num-extract 10000
: Specifies the number of images to process.--patch-size 14
: Patch size used by the model.--num-vis 20
: Number of images to visualize.--model ViT-H-14
: Specifies the CLIP model to use.--training-data laion2b_s32b_b79k
: Pretrained weights for the model.
Outputs:
- Extracted Features: Saved in the
features/
directory. - PCA Eigenvectors: Saved in the
eigenvecters/
directory. - Generated Masks, Maps, and Original Images: Saved in the
masks/
,maps/
, andoriginal_images/
directories, respectively. - Threshold for Foreground-Background Separation: During the PCA analysis, a background threshold is also calculated and used for generating masks. This threshold helps to separate foreground from background regions by comparing the PCA-transformed feature values with the threshold. The resulting masks can then be used to compute the foreground ratio for each image in the next steps.
To maintain semantic consistency while ensuring diversity, we determine appropriate noise levels for each image using the PCA vectors and the extracted image features.
First, we need to extract features for each image in the ImageNet dataset. This process may take around 4 hours with a batch size of 1024, and the extracted features will require approximately 4GB of storage.
Command to Extract Features:
python tools/clip_pca/extract_features_pca.py \
--input-list data/imagenet/train.txt \
--num-extract -1 \
--patch-size 14 \
--num-vis 20 \
--model ViT-H-14 \
--training-data laion2b_s32b_b79k
--input-list
: Path to the file containing the list of all training images.--num-extract -1
: Processes all images in the list (no limit).- Other parameters are the same as in Step 1.
Using the previously computed PCA vectors and the foreground-background threshold (fg_thre
), we calculate the foreground ratio (fg_ratio
) for each image in the dataset. The fg_ratio
helps quantify the proportion of foreground content within each image, which will later guide noise level determination for adaptive view generation.
Command to Calculate fg_ratio
:
python tools/clip_pca/calculate_fgratio.py \
--input-dir tools/clip_pca/pca_results/ViT-H-14-laion2b_s32b_b79k/ \
--input-list data/imagenet/train.txt \
--output-dir data/imagenet/ \
--fg-thre {computed_threshold}
--input-dir
: Path to the directory to save extracted features and PCA eigenvecters.--input-list
: Path to the file containing the list of all training images.--output-dir
: Directory where thefg_ratios.txt
file will be saved.--fg-thre {computed_threshold}
: The foreground-background threshold value (fg_thre
) calculated from Step 1 using PCA analysis. This threshold ensures the proper separation of foreground and background regions.
A file named fg_ratios.txt
will be generated in the specified output directory. This file contains a list of image paths paired with their respective fg_ratio
values.
Each line of fg_ratios.txt
is structured as:
<image_path> <fg_ratio>
Example:
data/imagenet/train/img_0001.jpg 0.42 data/imagenet/train/img_0002.jpg 0.38
Finally, we distribute the original fg_ratios.txt
entries into separate files based on specified ranges and mapping values. Each output file is named after its corresponding mapped noise level value (e.g., fg_ratios_0.txt
, fg_ratios_100.txt
, etc.), containing image paths and their fg_ratio
values that fall into the respective ranges.
Command to Generate Noise Level Files:
python tools/clip_pca/generate_ada_noise_level.py \
--input-file data/imagenet/fg_ratios.txt \
--output-dir data/imagenet/
--input-file
: Path to thefg_ratios.txt
generated in the previous step.--output-dir
: Directory where the noise level files will be saved.
These files categorize images based on their foreground ratios, allowing us to assign appropriate noise levels during image generation to achieve the desired balance between semantic consistency and diversity. ```
In this step, we generate image variations for the dataset by applying the calculated noise levels. This ensures the generated data maintains semantic consistency while introducing controlled diversity for adaptive view generation.
For each noise level file (e.g., fg_ratios_*.txt
), use the following command to generate image variations:
python tools/toolbox_genview/generate_image_variations_noiselevels.py \
--input-list data/imagenet/fg_ratios_{noise_level}.txt \
--output-prefix data/imagenet/train_variations/ \
--noise-level {noise_level}
--input-list
: Path to the text file that contains image paths andfg_ratio
values.--output-prefix
: Prefix for the output directory where the variations will be saved.--noise-level
: Noise level to apply to the image variations (options: 0, 100, 200, 300, 400).
Repeat this command for all fg_ratios_*.txt
files to generate the complete set of image variations.
Use the following shell script to parallelize the image generation process. This script splits the input list into multiple parts and processes them in parallel.
bash tools/toolbox_genview/run_parallel_levels.sh data/imagenet/fg_ratios_{noise_level}.txt data/imagenet/train_variations/ {noise_level}
Hereβs the revised content in English:
To simplify the data preparation process, pre-generated image variations are available for download. You can access them on https://huggingface.co/datasets/Xiaojie0903/Genview_syntheric_dataset_in1k and use them directly for your experiments and model training.
-
Download and Merge the Dataset:
After downloading all parts of the compressed dataset, merge them into a single file and extract the contents:
cd /path/to/download_tars/ cat train_variations.tar.* > train_variations.tar tar -xvf train_variations.tar
-
Create Symbolic Links:
To simplify access to the extracted data, create symbolic links in the
genview
project directory:cd genview mkdir -p data/imagenet cd data/imagenet ln -s /path/to/imagenet/train . ln -s /path/to/imagenet/val . ln -s /path/to/download_tars/train_variations/ .
train/
: Link to the original ImageNet training data.val/
: Link to the ImageNet validation data.train_variations/
: Link to the directory containing pre-generated image variations.
Once the image variations are prepared, generate a list of all synthetic images for further training and evaluation:
python tools/toolbox_genview/generate_train_variations_list.py \
--input-dir data/imagenet/train_variations \
--output-list data/imagenet/train_variations.txt
--input-dir
: Path to the directory containing the generated image variations.--output-list
: Path to save the generated image list.
Outputs:
- Image Variations: Saved in the
train_variations/
directory, with noise applied according to thefg_ratios_*.txt
files. - Synthetic Image List: A text file (
train_variations.txt
) containing paths to all generated image variations, saved indata/imagenet/
.
By completing this step, you will have a comprehensive dataset containing controlled image variations ready for self-supervised training with enhanced view quality.
We use the pretrained CLIP ConvNext-Base model as the encoder to extract feature maps from augmented positive views. These feature maps, with a resolution of 7Β² from a 224Β² input, are used to calculate foreground and background attention maps based on PCA.
We randomly sample 10,000 images to compute PCA features. The threshold ( \alpha ) ensures that 40% of the tokens represent the foreground, enabling clear separation.
Use the following command to extract features and compute PCA:
python tools/clip_pca/extract_features_pca.py \
--input-list tools/clip_pca/train_sampled_1000cls_10img.txt \
--num-extract 10000 \
--patch-size 32 \
--num-vis 20 \
--model convnext_base_w \
--training-data laion2b-s13b-b82k-augreg
- Extracted Features: Stored in
features/
. - PCA Eigenvectors: Stored in
eigenvectors/
. - Masks, Maps, and Original Images: Stored in
masks/
,maps/
, andoriginal_images/
.
These PCA vectors are used to generate foreground and background attention maps during pretraining. We provide precomputed PCA vectors, which can be found at tools/clip_pca/pca_results/convnext_base_w_laion2b-s13k-b82k-augreg/eigenvectors/pca_vectors.npy
Detailed commands for running pretraining and downstream tasks with single or multiple machines/GPUs:
Training with Multiple GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 PORT=29500 bash tools/dist_train.sh ${CONFIG_FILE} 8 [PY_ARGS] [--resume /path/to/latest/epoch_{number}.pth]
Training with Multiple Machines
CPUS_PER_TASK=8 GPUS_PER_NODE=8 GPUS=16 sh tools/slurm_train.sh $PARTITION $JOBNAME ${CONFIG_FILE} $WORK_DIR [--resume /path/to/latest/epoch_{number}.pth]
Ensure to replace $PARTITION
, $JOBNAME
, and $WORK_DIR
with actual values for your setup.
The following experiments provide various pretraining setups using different architectures, epochs, and GPU configurations.
SimSiam + ResNet50 + 200 Epochs + 8 GPUs
- Pretraining:
CPUS_PER_TASK=8 GPUS_PER_NODE=8 GPUS=8 sh tools/slurm_train.sh $PARTITION simsiam_pretrain configs_genview/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k_singleview_clipmask.py work_dirs/simsiam_resnet50_8xb32-coslr-200e_in1k_singleview_clipmask
- Linear Probe:
CPUS_PER_TASK=8 GPUS_PER_NODE=8 GPUS=8 sh tools/slurm_train.sh $PARTITION simsiam_linear configs_genview/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k_clip.py work_dirs/simsiam_resnet50_8xb32-coslr-200e_in1k_diffssl_prob1_128w_clipmask/linear --cfg-options model.backbone.init_cfg.checkpoint=work_dirs/simsiam_resnet50_8xb32-coslr-200e_in1k_diffssl_prob1_128w_clipmask/epoch_200.pth
MoCo v3 + ResNet50 + 100 Epochs + 8 GPUs
-
Pretraining:
CPUS_PER_TASK=8 GPUS_PER_NODE=8 GPUS=8 sh tools/slurm_train.sh $PARTITION mocov3r50_pretrain configs_genview/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_singleview_clipmask.py work_dirs/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_singleview_clipmask
-
Linear Probe:
CPUS_PER_TASK=8 GPUS_PER_NODE=8 GPUS=8 sh tools/slurm_train.sh $PARTITION mocov3r50_linear configs_genview/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k_clip.py work_dirs/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_singleview_clipmask/linear --cfg-options model.backbone.init_cfg.checkpoint=work_dirs/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_singleview_clipmask/epoch_100.pth
MoCo v3 + ViT-B + 300 Epochs + 16 GPUs
-
Pretraining:
CPUS_PER_TASK=8 GPUS_PER_NODE=8 GPUS=16 sh tools/slurm_train.sh $PARTITION mocov3vit_pretrain configs_genview/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k_singleview_clipmask.py work_dirs/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k_singleview_clipmask
-
Linear Probe:
CPUS_PER_TASK=8 GPUS_PER_NODE=8 GPUS=8 sh tools/slurm_train.sh $PARTITION mocov3vit_linear configs_genview/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k_clip.py work_dirs/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k_singleview_clipmask/linear --cfg-options model.backbone.init_cfg.checkpoint=work_dirs/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k_singleview_clipmask/epoch_300.pth
We have uploaded the pre-trained models to https://huggingface.co/Xiaojie0903/genview_pretrained_models. Access them directly using the links below:
Method | Backbone | Pretraining Epochs | Linear Probe Accuracy (%) | Model Link |
---|---|---|---|---|
MoCo v2 + GenView | ResNet-50 | 200 | 70.0 | Download |
SwAV + GenView | ResNet-50 | 200 | 71.7 | Download |
SimSiam + GenView | ResNet-50 | 200 | 72.2 | Download |
BYOL + GenView | ResNet-50 | 200 | 73.2 | Download |
MoCo v3 + GenView | ResNet-50 | 100 | 72.7 | Download |
MoCo v3 + GenView | ResNet-50 | 300 | 74.8 | Download |
MoCo v3 + GenView | ViT-S | 300 | 74.5 | Download |
MoCo v3 + GenView | ViT-B | 300 | 77.8 | Download |
If you find the repo useful for your research, please consider citing our paper:
@inproceedings{li2024genview,
author={Li, Xiaojie and Yang, Yibo and Li, Xiangtai and Wu, Jianlong and Yu, Yue and Ghanem, Bernard and Zhang, Min},
title={GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning},
year={2024},
pages={306--325},
booktitle={Proceedings of the European Conference on Computer Vision},
publisher="Springer"
}
This codebase builds on mmpretrain. Thanks to the contributors of this great codebase.