Skip to content

Latest commit

 

History

History
220 lines (177 loc) · 22.2 KB

README.md

File metadata and controls

220 lines (177 loc) · 22.2 KB

InternVideo [Paper]

中文 README

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

This repo gives the official implmentation of 'InternVideo: General Video Foundation Models via Generative and Discriminative Learning'

  • Achieved 91.1% Top1 accuracy in Kinetics 400, surpassing the 90% milestone for the first time.
  • Achieved 77.2% Top1 accuracy in Something-Something V2.
  • Achieved SOTA performance on 39 video datasets (including action recognition, temporal localization, retrieval, etc) when released in 2022.

Updates

  • Jan 16, 2024: InternVid (a video-text dataset for video understanding and generation) has been accepted for spotlight presentation of ICLR 2024.
  • Sep 7, 2023: ViCLIP: a simple video CLIP for transferrable video-text representation is available at Hugging Face and 🤗. It delivers strong zero-shot action recognition performance. Have a try.
  • July 16, 2023: A video-text dataset InternVid is partially released at here for facilitating multimodal understanding and generation. A subset of this dataset, consisting of 10 million video clips, is available at Hugging Face.
  • May 11, 2023: Video instruction data are released at here for tuning end-to-end video-centric multimodal dialogue systems like VideoChat.
  • Mar 8, 2023: All pretrained foundation model weights are released. See them from here.
  • Feb 19, 2023: Some pretrained foundation model weights (-L) are released.
  • Feb 5, 2023: The code & model of multimodal learning are released.
  • Jan 18, 2023: The code of vision-language navigation is released.
  • Jan 16, 2023: The code of video question answering, zero-shot action recognition, and zero-shot multiple choice is released.
  • Jan 1, 2023: The code & model of spatio-temporal action localiztion are released.
  • Dec 27, 2022: The code & model of partial pretraining (VideoMAE) and downstream applications (video-text retrieval, temporal action localization, open-set action recognition, and ego4d related tasks) are released.
  • Dec 6, 2022: The technical report of InternVideo is released.
  • Sep 2, 2022: Press releases (official | 163 news | qq news).

Introduction

We present the first video foundation model to achieve high-performance on both video and video-text tasks.

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively.

Code & model

Performance

Model Zoo

Pretrained Models
Model Training Data download
InternVideo-MM-L-14 WebVid10M+Self-collected (14M) ckpt
VideoMAE-B UnlabeledHybrid (1M) ckpt
VideoMAE-L UnlabeledHybrid (1M) ckpt
VideoMAE-H UnlabeledHybrid (1M) ckpt
Downstream Tasks

Classification

Model Finetuning Data download
VideoMAE-B K400 ckpt
VideoMAE-B K710 ckpt
VideoMAE-B SSv2 ckpt
VideoMAE-L K400 ckpt
VideoMAE-L K700 ckpt
VideoMAE-L SSv2 ckpt
VideoMAE-H K400 ckpt log
VideoMAE-H SSv1 ckpt log
VideoMAE-H HMDB51 ckpt_split1

Retrieval

Model Training Data download
InternVideo-MM-L-14 ActivityNet ckpt opt log
InternVideo-MM-L-14 DiDeMo ckpt opt log
InternVideo-MM-L-14 LSMDC ckpt opt log
InternVideo-MM-L-14 MSR-VTT ckpt opt log
InternVideo-MM-L-14 MSVD ckpt opt log
InternVideo-MM-L-14 VATEX ckpt opt log

VideoQA

Model Finetuning Data download
InternVideo-MM-L-14 MSR-VTT ckpt
InternVideo-MM-L-14 MSVD ckpt
InternVideo-MM-L-14 TGIFQA ckpt

Spatio-Temporal Action Localization

Model Finetuning Data download
VideoMAE-H AVA-Kinetics ckpt

To further improve our work, please fill out the form (or scan the below QR code) if you had time.

survey_icon

Citation

If this work is helpful for your research, please consider citing InternVideo.

@article{wang2022internvideo,
  title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
  author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2212.03191},
  year={2022}
}

@article{wang2023videomae,
  title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
  author={Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
  journal={arXiv preprint arXiv:2303.16727},
  year={2023}
}

@article{li2022uniformerv2,
  title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer},
  author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2211.09552},
  year={2022}
}

@article{li2023unmasked,
  title={Unmasked Teacher: Towards Training-Efficient Video Foundation Models},
  author={Li, Kunchang and Wang, Yali and Li, Yizhuo and Wang, Yi and He, Yinan and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2303.16058},
  year={2023}
}

@article{wang2023internvid,
  title={InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation},
  author={Wang, Yi and He, Yinan and Li, Yizhuo and Li, Kunchang and Yu, Jiashuo and Ma, Xin and Chen, Xinyuan and Wang, Yaohui and Luo, Ping and Liu, Ziwei and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2307.06942},
  year={2023}
}