Skip to content
/ PIIP Public

[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)

License

Notifications You must be signed in to change notification settings

OpenGVLab/PIIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyramid Parameter-Inverted Image Pyramid Networks (PIIP)

[Paper] [中文解读] [Slides] [Video]

The official implementation of the paper "Parameter-Inverted Image Pyramid Networks"

NeurIPS 2024 Spotlight (Top 2.08%)

⭐️ Highlights

TL;DR: We introduce the Parameter-Inverted Image Pyramid Networks (PIIP), employing a parameter-inverted paradigm that uses models with different parameter sizes to process different resolution levels of the image pyramid, thereby saving computation cost while improving the performance.

  • Support tasks of object detection, instance segmentation, semantic segmentation and image classification.
  • Surpasses single-branch methods with higher performance and lower computation cost.
  • Improve the performance of InternViT-6B on object detection by 2.0% (55.8% $\rm AP^b$) while reducing computation cost by 62%.

scatter

📌 Abstract

Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires significant computational cost. To overcome this issue, we propose a novel network architecture known as the Parameter-Inverted Image Pyramid Networks (PIIP). Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance. Specifically, the input to PIIP is a set of multi-scale images, where higher resolution images are processed by smaller networks. We further propose a feature interaction mechanism to allow features of different resolutions to complement each other and effectively integrate information from different spatial scales. Extensive experiments demonstrate that the PIIP achieves superior performance in tasks such as object detection, segmentation, and image classification, compared to traditional image pyramid methods and single-branch networks, while reducing computational cost. Notably, when applying our method on a large-scale vision foundation model InternViT-6B, we improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation. These results validate the effectiveness of the PIIP approach and provide a new technical direction for future vision computing tasks.

🔍 Method

Architecture

🛠️ Usage

For instructions on installation, pretrained models, training and evaluation, please refer to the readme files under each subfolder:

🚀 Released Models

COCO Object Detection and Instance Segmentation

Note:

  1. We report the number of parameters and FLOPs of the backbone.
  2. Results in the paper were obtained with an internal codebase, which may exhibit slightly different performance than this repo ($\leq\pm0.2$).
  3. Experiments involving InternViT-6B do not use window attention, different from those in the paper.
Backbone Detector Resolution Schd Box mAP Mask mAP #Param #FLOPs Download
ViT-B Mask R-CNN 1024 1x 43.7 39.7 90M 463G log | ckpt | cfg
PIIP-TSB Mask R-CNN 1120/896/448 1x 43.6 38.7 146M 243G log | ckpt | cfg
PIIP-TSB Mask R-CNN 1568/896/448 1x 45.0 40.3 147M 287G log | ckpt | cfg
PIIP-TSB Mask R-CNN 1568/1120/672 1x 46.5 41.3 149M 453G log | ckpt | cfg
ViT-L Mask R-CNN 1024 1x 46.7 42.5 308M 1542G log | ckpt | cfg
PIIP-SBL Mask R-CNN 1120/672/448 1x 46.5 40.8 493M 727G log | ckpt | cfg
PIIP-SBL Mask R-CNN 1344/896/448 1x 48.3 42.7 495M 1002G log | ckpt | cfg
PIIP-SBL Mask R-CNN 1568/896/672 1x 49.3 43.7 497M 1464G log | ckpt | cfg
PIIP-TSBL Mask R-CNN 1344/896/672/448 1x 47.1 41.9 506M 755G log | ckpt | cfg
PIIP-TSBL Mask R-CNN 1568/1120/672/448 1x 48.2 42.9 507M 861G log | ckpt | cfg
PIIP-TSBL Mask R-CNN 1792/1568/1120/448 1x 49.4 44.1 512M 1535G log | ckpt | cfg
InternViT-6B Mask R-CNN 1024 1x 53.8 48.1 5919M 29323G log | ckpt | cfg
PIIP-H6B Mask R-CNN 1024/512 1x 55.8 49.0 6872M 11080G log | ckpt | cfg
Backbone Detector Pretrain Resolution Schd Box mAP Mask mAP Download
PIIP-SBL Mask R-CNN AugReg (384) 1568/1120/672 1x 48.3 42.6 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + Uni-Perceiver (BL) 1568/1120/672 1x 48.8 42.9 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + MAE (BL) 1568/1120/672 1x 49.1 43.0 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III 1568/1120/672 1x 50.0 44.4 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + DINOv2 (BL) 1568/1120/672 1x 51.0 44.7 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + BEiTv2 (BL) 1568/1120/672 1x 51.8 45.4 log | ckpt | cfg
PIIP-SBL DINO DeiT III (384) 1792/1120/672 3x 57.8 - log | ckpt | cfg
PIIP-H6B DINO MAE (H) + InternVL (6B) 1024/768 1x 60.0 - log | ckpt | cfg

ADE20K Semantic Segmentation

Backbone Detector Resolution Schd mIoU #Param #FLOPs Download
InternViT-6B UperNet 512 80k 58.42 5910M 6364G log | ckpt | cfg
PIIP-H6B UperNet 512/192 80k 57.81 6745M 1663G log | ckpt | cfg
PIIP-H6B UperNet 512/256 80k 58.35 6745M 2354G log | ckpt | cfg
PIIP-H6B UperNet 512/384 80k 59.32 6746M 4374G log | ckpt | cfg
PIIP-H6B UperNet 512/512 80k 59.85 6747M 7308G log | ckpt | cfg

ImageNet-1K Image Classification

Model Resolution #Param #FLOPs Top-1 Acc Config Download
PIIP-TSB 368/192/128 144M 17.4G 82.1 config log | ckpt
PIIP-SBL 320/160/96 489M 39.0G 85.2 config log | ckpt
PIIP-SBL 384/192/128 489M 61.2G 85.9 config log | ckpt

📅 Schedule

  • detection code
  • classification code
  • segmentation code

🖊️ Citation

If you find this work helpful for your research, please consider giving this repo a star ⭐ and citing our paper:

@article{piip,
  title={Parameter-Inverted Image Pyramid Networks},
  author={Zhu, Xizhou and Yang, Xue and Wang, Zhaokai and Li, Hao and Dou, Wenhan and Ge, Junqi and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2406.04330},
  year={2024}
}

📃 License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

🙏 Acknowledgements

Our code is built with reference to the code of the following projects: InternVL-MMDetSeg, ViT-Adapter, DeiT, MMDetection, MMSegmentation, and timm. Thanks for their awesome work!