diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index d4f3101376..9583935859 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -5,7 +5,7 @@ repos:
     hooks:
       - id: flake8
   - repo: https://github.com/PyCQA/isort
-    rev: 5.10.1
+    rev: 5.11.5
     hooks:
       - id: isort
   - repo: https://github.com/pre-commit/mirrors-yapf
diff --git a/README.md b/README.md
index 4475b1a508..5e85da2fd5 100644
--- a/README.md
+++ b/README.md
@@ -70,15 +70,11 @@ The 1.x branch works with **PyTorch 1.6+**.
 
 ## What's New
 
-**Release**: v1.0.0rc2 with the following new features:
+**Release (2022.02.10)**: v1.0.0rc3 with the following new features:
 
-- We Support Omni-Sourece training on ImageNet and Kinetics datasets.
-- We support exporting spatial-temporal detection models to ONNX.
-- We support **STGCN++** on NTU-RGB+D.
-- We support **MViT V2** on Kinetics 400 and something-V2.
-- We refine our skeleton-based pipelines and support the joint training of multi-stream skeleton information, including **joint, bone, joint-motion, and bone-motion**.
-- We support **VideoMAE** on Kinetics400.
-- We support **C2D** on Kinetics400, achieve 73.57% Top-1 accuracy (higher than 71.8% in the [paper](https://arxiv.org/abs/1711.07971)).
+- Support Action Recognition model UniFormer V1(ICLR'2022), UniFormer V2(Arxiv'2022).
+- Support training MViT V2(CVPR'2022), and MaskFeat(CVPR'2022) fine-tuning.
+- Add a new handy interface for inference MMAction2 models ([demo](https://github.com/open-mmlab/mmaction2/blob/dev-1.x/demo/README.md#inferencer))
 
 ## Installation
 
@@ -119,9 +115,9 @@ Please refer to [install.md](https://mmaction2.readthedocs.io/en/1.x/get_started
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/configs/recognition/videomae/README.md">VideoMAE</a> (NeurIPS'2022)</td>
   </tr>
   <tr>
-    <td><a href="https://github.com/open-mmlab/mmaction2/blob/dev-1.x/configs/recognition/mvit/README.md">MViT V2</a> (CVPR'2022)</td>
-    <td></td>
-    <td></td>
+    <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/configs/recognition/mvit/README.md">MViT V2</a> (CVPR'2022)</td>
+    <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/configs/recognition/uniformer/README.md">UniFormer V1</a> (ICLR'2022)</td>
+    <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/configs/recognition/uniformerv2/README.md">UniFormer V2</a> (Arxiv'2022)</td>
     <td></td>
     <td></td>
   </tr>
@@ -209,7 +205,7 @@ If you have any feature requests, please feel free to leave a comment in [Issues
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/tools/data/ucf101_24/README.md">UCF101-24*</a> (<a href="http://www.thumos.info/download.html">Homepage</a>) (CRCV-IR-12-01)</td>
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/tools/data/jhmdb/README.md">JHMDB*</a> (<a href="http://jhmdb.is.tue.mpg.de/">Homepage</a>) (ICCV'2015)</td>
     <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/tools/data/ava/README.md">AVA</a> (<a href="https://research.google.com/ava/index.html">Homepage</a>) (CVPR'2018)</td>
-    <td></td>
+    <td><a href="https://github.com/open-mmlab/mmaction2/blob/1.x/tools/data/ava_kinetics/README.md">AVA-Kinetics</a> (<a href="https://research.google.com/ava/index.html">Homepage</a>) (Arxiv'2020)</td>
   </tr>
   <tr>
     <td colspan="4" style="font-weight:bold;">Skeleton-based Action Recognition</td>
diff --git a/configs/recognition/i3d/metafile.yml b/configs/recognition/i3d/metafile.yml
index 63ad017343..f12ba591dc 100644
--- a/configs/recognition/i3d/metafile.yml
+++ b/configs/recognition/i3d/metafile.yml
@@ -7,6 +7,8 @@ Collections:
 
 Models:
   - Name: i3d_imagenet-pretrained-r50-nl-dot-product_8xb8-32x2x1-100e_kinetics400-rgb
+    Alias:
+      - i3d
     Config: configs/recognition/i3d/i3d_imagenet-pretrained-r50-nl-dot-product_8xb8-32x2x1-100e_kinetics400-rgb.py
     In Collection: I3D
     Metadata:
diff --git a/configs/recognition/mvit/README.md b/configs/recognition/mvit/README.md
index d040cb1de4..15f8723615 100644
--- a/configs/recognition/mvit/README.md
+++ b/configs/recognition/mvit/README.md
@@ -23,28 +23,54 @@ well as 86.1% on Kinetics-400 video classification.
 <img src="https://user-images.githubusercontent.com/33249023/196627033-03a4e9b1-082e-42ee-a2a0-77f874fe632a.png" width="50%"/>
 </div>
 
-## Results and models
+## Results and Models
 
-### Kinetics-400
+1. Models with * in `Inference results` are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data, and models in `Training results` are trained in MMAction2 on our data.
+2. The values in columns named after `reference` are copied from paper, and `reference*` are results using [SlowFast](https://github.com/facebookresearch/SlowFast/) repo and trained on our data.
+3. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
+4. MaskFeat fine-tuning experiment is based on pretrain model from [MMSelfSup](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/projects/maskfeat_video), and the corresponding reference result is based on pretrain model from [SlowFast](https://github.com/facebookresearch/SlowFast/).
+5. Due to the different versions of Kinetics-400, our training results are different from paper.
+6. Due to the training efficiency, we currently only provide MViT-small training results, we don't ensure other config files' training accuracy and welcome you to contribute your reproduction results.
+7. We use `repeat augment` in MViT training configs following [SlowFast](https://github.com/facebookresearch/SlowFast/). [Repeat augment](https://arxiv.org/pdf/1901.09335.pdf) takes multiple times of data augment for one video, this way can improve the generalization of the model and relieve the IO stress of loading videos. And please note that the actual batch size is `num_repeats` times of `batch_size` in `train_dataloader`.
 
-| frame sampling strategy | resolution |  backbone  |   pretrain   | top1 acc | top5 acc |        reference top1 acc        |        reference top1 acc        | testing protocol | FLOPs | params |        config        |        ckpt        |
+### Inference results
+
+#### Kinetics-400
+
+| frame sampling strategy | resolution |  backbone  |   pretrain   | top1 acc | top5 acc |        reference top1 acc        |        reference top5 acc        | testing protocol | FLOPs | params |        config        |        ckpt        |
 | :---------------------: | :--------: | :--------: | :----------: | :------: | :------: | :------------------------------: | :------------------------------: | :--------------: | :---: | :----: | :------------------: | :----------------: |
-|         16x4x1          |  224x224   | MViTv2-S\* | From scratch |   81.1   |   94.7   | [81.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop |  64G  | 34.5M  | [config](/configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth) |
+|         16x4x1          |  224x224   | MViTv2-S\* | From scratch |   81.1   |   94.7   | [81.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop |  64G  | 34.5M  | [config](/configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth) |
 |         32x3x1          |  224x224   | MViTv2-B\* | From scratch |   82.6   |   95.8   | [82.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [95.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 225G  | 51.2M  | [config](/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth) |
 |         40x3x1          |  312x312   | MViTv2-L\* | From scratch |   85.4   |   96.2   | [86.1](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [97.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 3 crop | 2828G |  213M  | [config](/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth) |
 
-### Something-Something V2
+#### Something-Something V2
 
-| frame sampling strategy | resolution |  backbone  |   pretrain   | top1 acc | top5 acc |        reference top1 acc        |        reference top1 acc        | testing protocol | FLOPs | params |        config        |        ckpt        |
+| frame sampling strategy | resolution |  backbone  |   pretrain   | top1 acc | top5 acc |        reference top1 acc        |        reference top5 acc        | testing protocol | FLOPs | params |        config        |        ckpt        |
 | :---------------------: | :--------: | :--------: | :----------: | :------: | :------: | :------------------------------: | :------------------------------: | :--------------: | :---: | :----: | :------------------: | :----------------: |
-|       uniform 16        |  224x224   | MViTv2-S\* |     K400     |   68.1   |   91.0   | [68.2](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [91.4](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crop |  64G  | 34.4M  | [config](/configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth) |
+|       uniform 16        |  224x224   | MViTv2-S\* |     K400     |   68.1   |   91.0   | [68.2](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [91.4](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crop |  64G  | 34.4M  | [config](/configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth) |
 |       uniform 32        |  224x224   | MViTv2-B\* |     K400     |   70.8   |   92.7   | [70.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [92.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crop | 225G  | 51.1M  | [config](/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth) |
 |       uniform 40        |  312x312   | MViTv2-L\* | IN21K + K400 |   73.2   |   94.0   | [73.3](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crop | 2828G |  213M  | [config](/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth) |
 
-*Models with * are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data. Currently, we only support the testing of MViT models, training will be available soon.*
+### Training results
+
+#### Kinetics-400
+
+| frame sampling strategy | resolution | backbone |   pretrain    | top1 acc | top5 acc |     reference\* top1 acc      |      reference\* top5 acc      | testing protocol  | FLOPs | params |      config      |      ckpt      |      log      |
+| :---------------------: | :--------: | :------: | :-----------: | :------: | :------: | :---------------------------: | :----------------------------: | :---------------: | :---: | :----: | :--------------: | :------------: | :-----------: |
+|         16x4x1          |  224x224   | MViTv2-S | From scratch  |   80.6   |   94.7   | [80.8](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop  |  64G  | 34.5M  | [config](configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_20230201-23284ff3.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.log) |
+|         16x4x1          |  224x224   | MViTv2-S | K400 MaskFeat |   81.8   |   95.2   | [81.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/maskfeat/README.md) | [94.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/maskfeat/README.md) | 10 clips x 1 crop |  71G  | 36.4M  | [config](/configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb_20230201-5bced1d0.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.log) |
+
+the corresponding result without repeat augment is as follows:
+
+| frame sampling strategy | resolution | backbone |   pretrain   | top1 acc | top5 acc |                 reference\* top1 acc                 |                 reference\* top5 acc                 | testing protocol | FLOPs | params |
+| :---------------------: | :--------: | :------: | :----------: | :------: | :------: | :--------------------------------------------------: | :--------------------------------------------------: | :--------------: | :---: | :----: |
+|         16x4x1          |  224x224   | MViTv2-S | From scratch |   79.4   |   93.9   | [80.8](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop |  64G  | 34.5M  |
+
+#### Something-Something V2
 
-1. The values in columns named after "reference" are copied from paper
-2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
+| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc |      reference top1 acc       |       reference top5 acc       | testing protocol | FLOPs | params |       config       |       ckpt       |       log       |
+| :---------------------: | :--------: | :------: | :------: | :------: | :------: | :---------------------------: | :----------------------------: | :--------------: | :---: | :----: | :----------------: | :--------------: | :-------------: |
+|       uniform 16        |  224x224   | MViTv2-S |   K400   |   68.2   |   91.3   | [68.2](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [91.4](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crop |  64G  | 34.4M  | [config](/configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb_20230201-4065c1b9.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.log) |
 
 For more details on data preparation, you can refer to
 
diff --git a/configs/recognition/mvit/metafile.yml b/configs/recognition/mvit/metafile.yml
index 888fa24732..3170c61bdc 100644
--- a/configs/recognition/mvit/metafile.yml
+++ b/configs/recognition/mvit/metafile.yml
@@ -6,8 +6,8 @@ Collections:
     Title: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"
 
 Models:
-  - Name: mvit-small-p244_16x4x1_kinetics400-rgb
-    Config: configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
+  - Name: mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_infer
+    Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
     In Collection: MViT
     Metadata:
       Architecture: MViT-small
@@ -24,6 +24,28 @@ Models:
         Top 5 Accuracy: 94.7
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth
 
+  - Name: mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Batch Size: 16
+      Epochs: 100
+      FLOPs: 64G
+      Parameters: 34.5M
+      Resolution: 224x224
+      Training Data: Kinetics-400
+      Training Resources: 32 GPUs
+    Modality: RGB
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 80.6
+        Top 5 Accuracy: 94.7
+    Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.log
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_20230201-23284ff3.pth
+
   - Name: mvit-base-p244_32x3x1_kinetics400-rgb
     Config: configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
     In Collection: MViT
@@ -60,8 +82,8 @@ Models:
         Top 5 Accuracy: 94.7
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth
 
-  - Name: mvit-small-p244_u16_sthv2-rgb
-    Config: configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
+  - Name: mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb_infer
+    Config: configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.py
     In Collection: MViT
     Metadata:
       Architecture: MViT-small
@@ -78,6 +100,29 @@ Models:
         Top 5 Accuracy: 91.0
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth
 
+  - Name: mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Batch Size: 16
+      Epochs: 100
+      FLOPs: 64G
+      Parameters: 34.4M
+      Pretrained: Kinetics-400
+      Resolution: 224x224
+      Training Data: SthV2
+      Training Resources: 16 GPUs
+    Modality: RGB
+    Results:
+    - Dataset: SthV2
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 68.2
+        Top 5 Accuracy: 91.3
+    Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.log
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb_20230201-4065c1b9.pth
+
   - Name: mvit-base-p244_u32_sthv2-rgb
     Config: configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
     In Collection: MViT
@@ -113,3 +158,26 @@ Models:
         Top 1 Accuracy: 73.2
         Top 5 Accuracy: 94.0
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth
+
+  - Name: mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Batch Size: 32
+      Epochs: 100
+      FLOPs: 71G
+      Parameters: 36.4M
+      Pretrained: Kinetics-400 MaskFeat
+      Resolution: 224x224
+      Training Data: Kinetics-400
+      Training Resources: 8 GPUs
+    Modality: RGB
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.8
+        Top 5 Accuracy: 95.2
+    Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.log
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb_20230201-5bced1d0.pth
diff --git a/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py b/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
index b1e186f195..fb552c9329 100644
--- a/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
+++ b/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
@@ -76,13 +76,17 @@
     dict(type='PackActionInputs')
 ]
 
+repeat_sample = 2
 train_dataloader = dict(
     batch_size=8,
     num_workers=8,
     persistent_workers=True,
     sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='repeat_pseudo_collate'),
     dataset=dict(
-        type=dataset_type,
+        type='RepeatAugDataset',
+        num_repeats=repeat_sample,
+        sample_once=True,
         ann_file=ann_file_train,
         data_prefix=dict(video=data_root),
         pipeline=train_pipeline))
@@ -113,19 +117,21 @@
 test_evaluator = val_evaluator
 
 train_cfg = dict(
-    type='EpochBasedTrainLoop', max_epochs=30, val_begin=1, val_interval=3)
+    type='EpochBasedTrainLoop', max_epochs=200, val_begin=1, val_interval=1)
 val_cfg = dict(type='ValLoop')
 test_cfg = dict(type='TestLoop')
 
+base_lr = 1.6e-3
 optim_wrapper = dict(
-    type='AmpOptimWrapper',
     optimizer=dict(
-        type='AdamW', lr=1.6e-3, betas=(0.9, 0.999), weight_decay=0.05))
+        type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05),
+    paramwise_cfg=dict(norm_decay_mult=0.0, bias_decay_mult=0.0),
+    clip_grad=dict(max_norm=1, norm_type=2))
 
 param_scheduler = [
     dict(
         type='LinearLR',
-        start_factor=0.1,
+        start_factor=0.01,
         by_epoch=True,
         begin=0,
         end=30,
@@ -133,9 +139,9 @@
     dict(
         type='CosineAnnealingLR',
         T_max=200,
-        eta_min=0,
+        eta_min=base_lr / 100,
         by_epoch=True,
-        begin=0,
+        begin=30,
         end=200,
         convert_to_iter_based=True)
 ]
@@ -147,4 +153,4 @@
 #   - `enable` means enable scaling LR automatically
 #       or not by default.
 #   - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
-auto_scale_lr = dict(enable=False, base_batch_size=64)
+auto_scale_lr = dict(enable=False, base_batch_size=512 // repeat_sample)
diff --git a/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py b/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
index c954b60b54..cdbf22dd1f 100644
--- a/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
+++ b/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
@@ -108,7 +108,6 @@
 
 base_lr = 1.6e-3
 optim_wrapper = dict(
-    type='AmpOptimWrapper',
     optimizer=dict(
         type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05))
 
diff --git a/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py b/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py
index 8c93519914..f2d7ef1419 100644
--- a/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py
+++ b/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py
@@ -13,12 +13,6 @@
         type='ActionDataPreprocessor',
         mean=[114.75, 114.75, 114.75],
         std=[57.375, 57.375, 57.375],
-        blending=dict(
-            type='RandomBatchAugment',
-            augments=[
-                dict(type='MixupBlending', alpha=0.8, num_classes=400),
-                dict(type='CutmixBlending', alpha=1, num_classes=400)
-            ]),
         format_shape='NCTHW'),
     cls_head=dict(in_channels=1152),
     test_cfg=dict(max_testing_views=5))
@@ -78,13 +72,17 @@
     dict(type='PackActionInputs')
 ]
 
+repeat_sample = 2
 train_dataloader = dict(
     batch_size=8,
     num_workers=8,
     persistent_workers=True,
     sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='repeat_pseudo_collate'),
     dataset=dict(
-        type=dataset_type,
+        type='RepeatAugDataset',
+        num_repeats=repeat_sample,
+        sample_once=True,
         ann_file=ann_file_train,
         data_prefix=dict(video=data_root),
         pipeline=train_pipeline))
@@ -119,26 +117,21 @@
 val_cfg = dict(type='ValLoop')
 test_cfg = dict(type='TestLoop')
 
+base_lr = 1.6e-3
 optim_wrapper = dict(
-    type='AmpOptimWrapper',
     optimizer=dict(
-        type='AdamW', lr=1.6e-3, betas=(0.9, 0.999), weight_decay=0.05))
+        type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=10e-8),
+    paramwise_cfg=dict(norm_decay_mult=0.0, bias_decay_mult=0.0),
+    clip_grad=dict(max_norm=1, norm_type=2))
 
 param_scheduler = [
-    dict(
-        type='LinearLR',
-        start_factor=0.1,
-        by_epoch=True,
-        begin=0,
-        end=30,
-        convert_to_iter_based=True),
     dict(
         type='CosineAnnealingLR',
-        T_max=200,
+        T_max=30,
         eta_min=0,
         by_epoch=True,
         begin=0,
-        end=200,
+        end=30,
         convert_to_iter_based=True)
 ]
 
@@ -149,4 +142,4 @@
 #   - `enable` means enable scaling LR automatically
 #       or not by default.
 #   - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
-auto_scale_lr = dict(enable=True, base_batch_size=512)
+auto_scale_lr = dict(enable=True, base_batch_size=128 // repeat_sample)
diff --git a/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py b/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py
index b3fde41a78..ea9d54c068 100644
--- a/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py
+++ b/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py
@@ -110,7 +110,6 @@
 
 base_lr = 1.6e-3
 optim_wrapper = dict(
-    type='AmpOptimWrapper',
     optimizer=dict(
         type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05))
 
diff --git a/configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py b/configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
similarity index 91%
rename from configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
rename to configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
index 4da89b5a4a..9f6b1cbd6d 100644
--- a/configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
+++ b/configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
@@ -24,6 +24,7 @@
 ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
 
 file_client_args = dict(io_backend='disk')
+
 train_pipeline = [
     dict(type='DecordInit', **file_client_args),
     dict(type='SampleFrames', clip_len=16, frame_interval=4, num_clips=1),
@@ -70,13 +71,17 @@
     dict(type='PackActionInputs')
 ]
 
+repeat_sample = 2
 train_dataloader = dict(
     batch_size=8,
     num_workers=8,
     persistent_workers=True,
     sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='repeat_pseudo_collate'),
     dataset=dict(
-        type=dataset_type,
+        type='RepeatAugDataset',
+        num_repeats=repeat_sample,
+        sample_once=True,
         ann_file=ann_file_train,
         data_prefix=dict(video=data_root),
         pipeline=train_pipeline))
@@ -107,20 +112,21 @@
 test_evaluator = val_evaluator
 
 train_cfg = dict(
-    type='EpochBasedTrainLoop', max_epochs=200, val_begin=1, val_interval=3)
+    type='EpochBasedTrainLoop', max_epochs=200, val_begin=1, val_interval=1)
 val_cfg = dict(type='ValLoop')
 test_cfg = dict(type='TestLoop')
 
 base_lr = 1.6e-3
 optim_wrapper = dict(
-    type='AmpOptimWrapper',
     optimizer=dict(
-        type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05))
+        type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05),
+    paramwise_cfg=dict(norm_decay_mult=0.0, bias_decay_mult=0.0),
+    clip_grad=dict(max_norm=1, norm_type=2))
 
 param_scheduler = [
     dict(
         type='LinearLR',
-        start_factor=0.1,
+        start_factor=0.01,
         by_epoch=True,
         begin=0,
         end=30,
@@ -142,4 +148,4 @@
 #   - `enable` means enable scaling LR automatically
 #       or not by default.
 #   - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
-auto_scale_lr = dict(enable=True, base_batch_size=512)
+auto_scale_lr = dict(enable=True, base_batch_size=512 // repeat_sample)
diff --git a/configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py b/configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py
new file mode 100644
index 0000000000..6fa2a5e654
--- /dev/null
+++ b/configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py
@@ -0,0 +1,158 @@
+_base_ = [
+    '../../_base_/models/mvit_small.py', '../../_base_/default_runtime.py'
+]
+
+model = dict(
+    backbone=dict(
+        drop_path_rate=0.1,
+        dim_mul_in_attention=False,
+        pretrained=  # noqa: E251
+        'https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400_20230131-87d60b6f.pth',  # noqa
+        pretrained_type='maskfeat',
+    ),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        blending=dict(
+            type='RandomBatchAugment',
+            augments=[
+                dict(type='MixupBlending', alpha=0.8, num_classes=400),
+                dict(type='CutmixBlending', alpha=1, num_classes=400)
+            ]),
+        format_shape='NCTHW'),
+    cls_head=dict(dropout_ratio=0., init_scale=0.001))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root = 'data/kinetics400/videos_train'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt'
+ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+
+file_client_args = dict(io_backend='disk')
+train_pipeline = [
+    dict(type='DecordInit', **file_client_args),
+    dict(type='SampleFrames', clip_len=16, frame_interval=4, num_clips=1),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(type='PytorchVideoWrapper', op='RandAugment', magnitude=7),
+    dict(type='RandomResizedCrop'),
+    dict(type='Resize', scale=(224, 224), keep_ratio=False),
+    dict(type='Flip', flip_ratio=0.5),
+    dict(type='RandomErasing', erase_prob=0.25, mode='rand'),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+val_pipeline = [
+    dict(type='DecordInit', **file_client_args),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=1,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+test_pipeline = [
+    dict(type='DecordInit', **file_client_args),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=10,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+repeat_sample = 2
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='repeat_pseudo_collate'),
+    dataset=dict(
+        type='RepeatAugDataset',
+        num_repeats=repeat_sample,
+        ann_file=ann_file_train,
+        data_prefix=dict(video=data_root),
+        pipeline=train_pipeline))
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_val,
+        data_prefix=dict(video=data_root_val),
+        pipeline=val_pipeline,
+        test_mode=True))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True))
+
+val_evaluator = dict(type='AccMetric')
+test_evaluator = val_evaluator
+
+train_cfg = dict(
+    type='EpochBasedTrainLoop', max_epochs=100, val_begin=1, val_interval=1)
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+base_lr = 9.6e-3  # for batch size 512
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05),
+    constructor='LearningRateDecayOptimizerConstructor',
+    paramwise_cfg={
+        'decay_rate': 0.75,
+        'decay_type': 'layer_wise',
+        'num_layers': 16
+    },
+    clip_grad=dict(max_norm=5, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1 / 600,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        eta_min_ratio=1 / 600,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+default_hooks = dict(
+    checkpoint=dict(interval=3, max_keep_ckpts=20), logger=dict(interval=100))
+
+# Default setting for scaling LR automatically
+#   - `enable` means enable scaling LR automatically
+#       or not by default.
+#   - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
+auto_scale_lr = dict(enable=True, base_batch_size=512 // repeat_sample)
diff --git a/configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py b/configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.py
similarity index 91%
rename from configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
rename to configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.py
index 08934b9a5e..1b4135b52e 100644
--- a/configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
+++ b/configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.py
@@ -2,7 +2,14 @@
     '../../_base_/models/mvit_small.py', '../../_base_/default_runtime.py'
 ]
 
-model = dict(cls_head=dict(num_classes=174))
+model = dict(
+    backbone=dict(
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth',  # noqa: E501
+            prefix='backbone.')),
+    cls_head=dict(num_classes=174))
 
 # dataset settings
 dataset_type = 'VideoDataset'
@@ -91,7 +98,6 @@
 
 base_lr = 1.6e-3
 optim_wrapper = dict(
-    type='AmpOptimWrapper',
     optimizer=dict(
         type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05),
     paramwise_cfg=dict(norm_decay_mult=0.0, bias_decay_mult=0.0))
diff --git a/configs/recognition/omnisource/README.md b/configs/recognition/omnisource/README.md
new file mode 100644
index 0000000000..64acf52c35
--- /dev/null
+++ b/configs/recognition/omnisource/README.md
@@ -0,0 +1,79 @@
+# Omnisource
+
+<!-- TODO: add links to the tech report -->
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+We propose to train a recognizer that can classify images and videos. The recognizer is jointly trained on image and video datasets. Compared with pre-training on the same image dataset, this method can significantly improve the video recognition performance.
+
+<!-- [IMAGE]
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/34324155/143044111-94676f64-7ba8-4081-9011-f8054bed7030.png" width="800"/>
+</div>
+-->
+
+## Results and Models
+
+### Kinetics-400
+
+| frame sampling strategy |   scheduler   | resolution | gpus | backbone | joint-training | top1 acc | top5 acc | testing protocol  | FLOPs  | params |            config             |            ckpt             |             log             |
+| :---------------------: | :-----------: | :--------: | :--: | :------: | :------------: | :------: | :------: | :---------------: | :----: | :----: | :---------------------------: | :-------------------------: | :-------------------------: |
+|          8x8x1          | Linear+Cosine |  224x224   |  8   | ResNet50 |    ImageNet    |  77.30   |  93.23   | 10 clips x 3 crop | 54.75G | 32.45M | [config](/configs/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb_20230208-61c4be0d.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.log) |
+
+1. The **gpus** indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set `--auto-scale-lr` when calling `tools/train.py`, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
+2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
+
+For more details on data preparation, you can refer to [Kinetics400](/tools/data/kinetics/README.md).
+
+## Train
+
+You can use the following command to train a model.
+
+```shell
+python tools/train.py ${CONFIG_FILE} [optional arguments]
+```
+
+Example: train SlowOnly model on Kinetics-400 dataset in a deterministic option with periodic validation.
+
+```shell
+python tools/train.py configs/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.py \
+    --seed=0 --deterministic
+```
+
+We found that the training of this Omnisource model could crash for unknown reasons. If this happens, you can resume training by adding the `--cfg-options resume=True` to the training script.
+
+For more details, you can refer to the **Training** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
+
+## Test
+
+You can use the following command to test a model.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+Example: test SlowOnly model on Kinetics-400 dataset and dump the result to a pkl file.
+
+```shell
+python tools/test.py configs/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.py \
+    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
+```
+
+For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
+
+## Citation
+
+```BibTeX
+@inproceedings{feichtenhofer2019slowfast,
+  title={Slowfast networks for video recognition},
+  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
+  booktitle={Proceedings of the IEEE international conference on computer vision},
+  pages={6202--6211},
+  year={2019}
+}
+```
diff --git a/configs/recognition/omnisource/metafile.yml b/configs/recognition/omnisource/metafile.yml
new file mode 100644
index 0000000000..af4524e5b0
--- /dev/null
+++ b/configs/recognition/omnisource/metafile.yml
@@ -0,0 +1,28 @@
+Collections:
+  - Name: Omnisource
+    README: configs/recognition/omnisource/README.md
+
+
+Models:
+  - Name: slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb
+    Config: configs/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.py
+    In Collection: SlowOnly
+    Metadata:
+      Architecture: ResNet50
+      Batch Size: 16
+      Epochs: 256
+      FLOPs: 54.75G
+      Parameters: 32.45M
+      Pretrained: None
+      Resolution: short-side 320
+      Training Data: Kinetics-400
+      Training Resources: 8 GPUs
+    Modality: RGB
+    Results:
+      - Dataset: Kinetics-400
+        Task: Action Recognition
+        Metrics:
+          Top 1 Accuracy: 77.30
+          Top 5 Accuracy: 93.23
+    Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.log
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb_20230208-61c4be0d.pth
diff --git a/configs/recognition/omnisource/slowonly_r50_16xb16-8x8x1-256e_imagenet-kinetics400-rgb.py b/configs/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.py
similarity index 98%
rename from configs/recognition/omnisource/slowonly_r50_16xb16-8x8x1-256e_imagenet-kinetics400-rgb.py
rename to configs/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.py
index 05feb2710a..2b28285635 100644
--- a/configs/recognition/omnisource/slowonly_r50_16xb16-8x8x1-256e_imagenet-kinetics400-rgb.py
+++ b/configs/recognition/omnisource/slowonly_r50_8xb16-8x8x1-256e_imagenet-kinetics400-rgb.py
@@ -159,7 +159,7 @@
         convert_to_iter_based=True)
 ]
 """
-The learning rate is for total_batch_size = 16 x 16 (num_gpus x batch_size)
+The learning rate is for total_batch_size = 8 x 16 (num_gpus x batch_size)
 If you want to use other batch size or number of GPU settings, please update
 the learning rate with the linear scaling rule.
 """
diff --git a/configs/recognition/slowfast/metafile.yml b/configs/recognition/slowfast/metafile.yml
index 94423659d1..7ba12c0e63 100644
--- a/configs/recognition/slowfast/metafile.yml
+++ b/configs/recognition/slowfast/metafile.yml
@@ -30,6 +30,8 @@ Models:
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/slowfast/slowfast_r50_8xb8-4x16x1-256e_kinetics400-rgb/slowfast_r50_8xb8-4x16x1-256e_kinetics400-rgb_20220901-701b0f6f.pth
 
   - Name: slowfast_r50_8xb8-8x8x1-256e_kinetics400-rgb
+    Alias:
+      - slowfast
     Config: configs/recognition/slowfast/slowfast_r50_8xb8-8x8x1-256e_kinetics400-rgb.py
     In Collection: SlowFast
     Metadata:
diff --git a/configs/recognition/tsn/metafile.yml b/configs/recognition/tsn/metafile.yml
index b4734c93a2..e618ed71cc 100644
--- a/configs/recognition/tsn/metafile.yml
+++ b/configs/recognition/tsn/metafile.yml
@@ -53,6 +53,8 @@ Models:
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x5-100e_kinetics400-rgb/tsn_imagenet-pretrained-r50_8xb32-1x1x5-100e_kinetics400-rgb_20220906-65d68713.pth
 
   - Name: tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb
+    Alias:
+      - TSN
     Config: configs/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb.py
     In Collection: TSN
     Metadata:
diff --git a/configs/recognition/uniformer/README.md b/configs/recognition/uniformer/README.md
new file mode 100644
index 0000000000..65c224ecc3
--- /dev/null
+++ b/configs/recognition/uniformer/README.md
@@ -0,0 +1,67 @@
+# UniFormer
+
+[UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning](https://arxiv.org/abs/2201.04676)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://raw.githubusercontent.com/Sense-X/UniFormer/main/figures/framework.png"/>
+</div>
+
+## Results and Models
+
+### Kinetics-400
+
+| frame sampling strategy |   resolution   |  backbone   | top1 acc | top5 acc | [reference](<(https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md)>) top1 acc | [reference](<(https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md)>) top5 acc | mm-Kinetics top1 acc | mm-Kinetics top5 acc | testing protocol | FLOPs | params |                                              config                                               |                                                                           ckpt                                                                           |
+| :---------------------: | :------------: | :---------: | :------: | :------: | :-----------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------: | :------------------: | :------------------: | :--------------: | :---: | :----: | :-----------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|         16x4x1          | short-side 320 | UniFormer-S |   80.9   |   94.6   |                                                  80.8                                                   |                                                  94.7                                                   |         80.9         |         94.6         | 4 clips x 1 crop | 41.8G | 21.4M  | [config](/configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-c630a037.pth) |
+|         16x4x1          | short-side 320 | UniFormer-B |   82.0   |   95.0   |                                                  82.0                                                   |                                                  95.1                                                   |         82.0         |         95.0         | 4 clips x 1 crop | 96.7G | 49.8M  | [config](/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py)  | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-157c2e66.pth)  |
+|         32x4x1          | short-side 320 | UniFormer-B |   83.1   |   95.3   |                                                  82.9                                                   |                                                  95.4                                                   |         83.0         |         95.3         | 4 clips x 1 crop |  59G  | 49.8M  | [config](/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py)  | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb_20221219-b776322c.pth)  |
+
+The models are ported from the repo [UniFormer](https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md) and tested on our data. Currently, we only support the testing of UniFormer models, training will be available soon.
+
+1. The values in columns named after "reference" are the results of the original repo.
+2. The values in `top1/5 acc` is tested on the same data list as the original repo, and the label map is provided by [UniFormer](https://drive.google.com/drive/folders/17VB-XdF3Kfr9ORmnGyXCxTMs86n0L4QL). The total videos are available at [Kinetics400](https://pan.baidu.com/s/1t5K0FRz3PGAT-37-3FwAfg) (BaiduYun password: g5kp), which consists of 19787 videos.
+3. The values in columns named after "mm-Kinetics" are the testing results on the Kinetics dataset held by MMAction2, which is also used by other models in MMAction2. Due to the differences between various versions of Kinetics dataset, there is a little gap between `top1/5 acc` and `mm-Kinetics top1/5 acc`. For a fair comparison with other models, we report both results here. Note that we simply report the inference results, since the training set is different between UniFormer and other models, the results are lower than that tested on the author's version.
+4. Since the original models for Kinetics-400/600/700 adopt different [label file](https://drive.google.com/drive/folders/17VB-XdF3Kfr9ORmnGyXCxTMs86n0L4QL), we simply map the weight according to the label name. New label map for Kinetics-400/600/700 can be found [here](https://github.com/open-mmlab/mmaction2/tree/dev-1.x/tools/data/kinetics).
+5. Due to some difference between [SlowFast](https://github.com/facebookresearch/SlowFast) and MMAction, there are some gaps between their performances.
+
+For more details on data preparation, you can refer to [preparing_kinetics](/tools/data/kinetics/README.md).
+
+## Test
+
+You can use the following command to test a model.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+Example: test UniFormer-S model on Kinetics-400 dataset and dump the result to a pkl file.
+
+```shell
+python tools/test.py configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py \
+    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
+```
+
+For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
+
+## Citation
+
+```BibTeX
+@inproceedings{
+  li2022uniformer,
+  title={UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning},
+  author={Kunchang Li and Yali Wang and Gao Peng and Guanglu Song and Yu Liu and Hongsheng Li and Yu Qiao},
+  booktitle={International Conference on Learning Representations},
+  year={2022},
+  url={https://openreview.net/forum?id=nBU_u6DLvoK}
+}
+```
diff --git a/configs/recognition/uniformer/metafile.yml b/configs/recognition/uniformer/metafile.yml
new file mode 100644
index 0000000000..8bcfefe450
--- /dev/null
+++ b/configs/recognition/uniformer/metafile.yml
@@ -0,0 +1,70 @@
+Collections:
+- Name: UniFormer
+  README: configs/recognition/uniformer/README.md
+  Paper:
+    URL: https://arxiv.org/abs/2201.04676
+    Title: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning"
+
+Models:
+  - Name: uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb
+    Config: configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormer-S
+      Pretrained: ImageNet-1K
+      Resolution: short-side 320
+      Frame: 16
+      Sampling rate: 4
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md
+      Code: https://github.com/Sense-X/UniFormer/tree/main/video_classification
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 80.9
+        Top 5 Accuracy: 94.6
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-c630a037.pth
+
+  - Name: uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb
+    Config: configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormer-B
+      Pretrained: ImageNet-1K
+      Resolution: short-side 320
+      Frame: 16
+      Sampling rate: 4
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md
+      Code: https://github.com/Sense-X/UniFormer/tree/main/video_classification
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 82.0
+        Top 5 Accuracy: 95.0
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-157c2e66.pth
+
+  - Name: uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb
+    Config: configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormer-B
+      Pretrained: ImageNet-1K
+      Resolution: short-side 320
+      Frame: 32
+      Sampling rate: 4
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md
+      Code: https://github.com/Sense-X/UniFormer/tree/main/video_classification
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 83.1
+        Top 5 Accuracy: 95.3
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb_20221219-b776322c.pth
diff --git a/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py b/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py
new file mode 100644
index 0000000000..459dc58883
--- /dev/null
+++ b/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py
@@ -0,0 +1,58 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormer',
+        depth=[5, 8, 20, 7],
+        embed_dim=[64, 128, 320, 512],
+        head_dim=64,
+        drop_path_rate=0.3),
+    cls_head=dict(
+        type='I3DHead',
+        dropout_ratio=0.,
+        num_classes=400,
+        in_channels=512,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py b/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py
new file mode 100644
index 0000000000..9f425b6542
--- /dev/null
+++ b/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py
@@ -0,0 +1,58 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormer',
+        depth=[5, 8, 20, 7],
+        embed_dim=[64, 128, 320, 512],
+        head_dim=64,
+        drop_path_rate=0.3),
+    cls_head=dict(
+        type='I3DHead',
+        dropout_ratio=0.,
+        num_classes=400,
+        in_channels=512,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=32,
+        frame_interval=4,
+        num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py b/configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py
new file mode 100644
index 0000000000..9b4ed546c2
--- /dev/null
+++ b/configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py
@@ -0,0 +1,58 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormer',
+        depth=[3, 4, 8, 3],
+        embed_dim=[64, 128, 320, 512],
+        head_dim=64,
+        drop_path_rate=0.1),
+    cls_head=dict(
+        type='I3DHead',
+        dropout_ratio=0.,
+        num_classes=400,
+        in_channels=512,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/README.md b/configs/recognition/uniformerv2/README.md
new file mode 100644
index 0000000000..c69b69a662
--- /dev/null
+++ b/configs/recognition/uniformerv2/README.md
@@ -0,0 +1,108 @@
+# UniFormerV2
+
+[UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer](https://arxiv.org/abs/2211.09552)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://raw.githubusercontent.com/OpenGVLab/UniFormerV2/main/img/framework.png"/>
+</div>
+
+## Results and Models
+
+### Kinetics-400
+
+| uniform sampling |   resolution   |       backbone       | top1 acc | top5 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top1 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top5 acc | mm-Kinetics top1 acc | mm-Kinetics top5 acc | testing protocol | FLOPs | params |                                                       config                                                        |                                                                                         ckpt                                                                                         |
+| :--------------: | :------------: | :------------------: | :------: | :------: | :---------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------: | :------------------: | :------------------: | :--------------: | :---: | :----: | :-----------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|        8         | short-side 320 |   UniFormerV2-B/16   |   85.8   |   97.1   |                                           85.6                                            |                                           97.0                                            |         85.8         |         97.1         | 4 clips x 3 crop | 0.1T  |  115M  |  [config](/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py)  |  [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb_20221219-203d6aac.pth)  |
+|        8         | short-side 320 |   UniFormerV2-L/14   |   88.7   |   98.1   |                                           88.8                                            |                                           98.1                                            |         88.7         |         98.1         | 4 clips x 3 crop | 0.7T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py)  | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb_20221219-972ea063.pth)  |
+|        16        | short-side 320 |   UniFormerV2-L/14   |   89.0   |   98.2   |                                           89.1                                            |                                           98.2                                            |         89.0         |         98.2         | 4 clips x 3 crop | 1.3T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb_20221219-6dc86d05.pth) |
+|        32        | short-side 320 |   UniFormerV2-L/14   |   89.3   |   98.2   |                                           89.3                                            |                                           98.2                                            |         89.4         |         98.2         | 2 clips x 3 crop | 2.7T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb_20221219-56a46f64.pth) |
+|        32        | short-side 320 | UniFormerV2-L/14@336 |   89.5   |   98.4   |                                           89.7                                            |                                           98.3                                            |         89.5         |         98.4         | 2 clips x 3 crop | 6.3T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb_20221219-1dd7650f.pth) |
+
+### Kinetics-600
+
+| uniform sampling | resolution |       backbone       | top1 acc | top5 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top1 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top5 acc | mm-Kinetics top1 acc | mm-Kinetics top5 acc | testing protocol | FLOPs | params |                                                       config                                                        |                                                                                         ckpt                                                                                         |
+| :--------------: | :--------: | :------------------: | :------: | :------: | :---------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------: | :------------------: | :------------------: | :--------------: | :---: | :----: | :-----------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|        8         |    Raw     |   UniFormerV2-B/16   |   86.4   |   97.3   |                                           86.1                                            |                                           97.2                                            |         85.5         |         97.0         | 4 clips x 3 crop | 0.1T  |  115M  |  [config](/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py)  |  [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb_20221219-c62c4da4.pth)  |
+|        8         |    Raw     |   UniFormerV2-L/14   |   89.0   |   98.3   |                                           89.0                                            |                                           98.2                                            |         87.5         |         98.0         | 4 clips x 3 crop | 0.7T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py)  | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb_20221219-cf88e4c2.pth)  |
+|        16        |    Raw     |   UniFormerV2-L/14   |   89.4   |   98.3   |                                           89.4                                            |                                           98.3                                            |         87.8         |         98.0         | 4 clips x 3 crop | 1.3T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb_20221219-38ff0e3e.pth) |
+|        32        |    Raw     |   UniFormerV2-L/14   |   89.2   |   98.3   |                                           89.5                                            |                                           98.3                                            |         87.7         |         98.1         | 2 clips x 3 crop | 2.7T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb_20221219-d450d071.pth) |
+|        32        |    Raw     | UniFormerV2-L/14@336 |   89.8   |   98.5   |                                           89.9                                            |                                           98.5                                            |         88.8         |         98.3         | 2 clips x 3 crop | 6.3T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb_20221219-f984f5d2.pth) |
+
+### Kinetics-700
+
+| uniform sampling | resolution |       backbone       | top1 acc | top5 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top1 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top5 acc | mm-Kinetics top1 acc | mm-Kinetics top5 acc | testing protocol | FLOPs | params |                                                       config                                                        |                                                                                         ckpt                                                                                         |
+| :--------------: | :--------: | :------------------: | :------: | :------: | :---------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------: | :------------------: | :------------------: | :--------------: | :---: | :----: | :-----------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|        8         |    Raw     |   UniFormerV2-B/16   |   76.3   |   92.9   |                                           76.3                                            |                                           92.7                                            |         75.1         |         92.5         | 4 clips x 3 crop | 0.1T  |  115M  |  [config](/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py)  |  [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb_20221219-8a7c4ac4.pth)  |
+|        8         |    Raw     |   UniFormerV2-L/14   |   80.8   |   95.2   |                                           80.8                                            |                                           95.4                                            |         79.4         |         94.8         | 4 clips x 3 crop | 0.7T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py)  | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb_20221219-bfb9f401.pth)  |
+|        16        |    Raw     |   UniFormerV2-L/14   |   81.2   |   95.6   |                                           81.2                                            |                                           95.6                                            |         79.2         |         95.0         | 4 clips x 3 crop | 1.3T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb_20221219-745209d2.pth) |
+|        32        |    Raw     |   UniFormerV2-L/14   |   81.4   |   95.7   |                                           81.5                                            |                                           95.7                                            |         79.8         |         95.3         | 2 clips x 3 crop | 2.7T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb_20221219-eebe7056.pth) |
+|        32        |    Raw     | UniFormerV2-L/14@336 |   82.1   |   96.0   |                                           82.1                                            |                                           96.1                                            |         80.6         |         95.6         | 2 clips x 3 crop | 6.3T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb_20221219-95cf9046.pth) |
+
+### MiTv1
+
+| uniform sampling | resolution |       backbone       | top1 acc | top5 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top1 acc | [reference](<(https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md)>) top5 acc | testing protocol | FLOPs | params |                                                           config                                                           |                                                                                         ckpt                                                                                          |
+| :--------------: | :--------: | :------------------: | :------: | :------: | :---------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------: | :--------------: | :---: | :----: | :------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|        8         |    Raw     |   UniFormerV2-B/16   |   42.7   |   71.6   |                                           42.6                                            |                                           71.7                                            | 4 clips x 3 crop | 0.1T  |  115M  | [config](/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py)  | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/mitv1/uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb_20221219-fddbc786.pth)  |
+|        8         |    Raw     |   UniFormerV2-L/14   |   47.0   |   76.1   |                                           47.0                                            |                                           76.1                                            | 4 clips x 3 crop | 0.7T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/mitv1/uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb_20221219-882c0598.pth) |
+|        8         |    Raw     | UniFormerV2-L/14@336 |   47.7   |   76.8   |                                           47.8                                            |                                           76.0                                            | 4 clips x 3 crop | 1.6T  |  354M  | [config](/configs/recognition/uniformerv2/uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/mitv1/uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb_20221219-9020986e.pth) |
+
+### Kinetics-710
+
+| uniform sampling | resolution |       backbone       |                                     config                                     |                                     ckpt                                     |
+| :--------------: | :--------: | :------------------: | :----------------------------------------------------------------------------: | :--------------------------------------------------------------------------: |
+|        8         |    Raw     |   UniFormerV2-B/16   | [config](/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics710/uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb_20221219-77d34f81.pth) |
+|        8         |    Raw     |   UniFormerV2-L/14   | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-pre_u8_kinetics710-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics710/uniformerv2-large-p14-res224_clip-pre_u8_kinetics710-rgb_20221219-bfaae587.pth) |
+|        8         |    Raw     | UniFormerV2-L/14@336 | [config](/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics710/uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb_20221219-55878cdc.pth) |
+
+The models are ported from the repo [UniFormerV2](https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md) and tested on our data. Currently, we only support the testing of UniFormerV2 models, training will be available soon.
+
+1. The values in columns named after "reference" are the results of the original repo.
+2. The values in `top1/5 acc` is tested on the same data list as the original repo, and the label map is provided by [UniFormerV2](https://drive.google.com/drive/folders/17VB-XdF3Kfr9ORmnGyXCxTMs86n0L4QL).
+3. The values in columns named after "mm-Kinetics" are the testing results on the Kinetics dataset held by MMAction2, which is also used by other models in MMAction2. Due to the differences between various versions of Kinetics dataset, there is a little gap between `top1/5 acc` and `mm-Kinetics top1/5 acc`. For a fair comparison with other models, we report both results here. Note that we simply report the inference results, since the training set is different between UniFormer and other models, the results are lower than that tested on the author's version.
+4. Since the original models for Kinetics-400/600/700 adopt different [label file](https://drive.google.com/drive/folders/17VB-XdF3Kfr9ORmnGyXCxTMs86n0L4QL), we simply map the weight according to the label name. New label map for Kinetics-400/600/700 can be found [here](https://github.com/open-mmlab/mmaction2/tree/dev-1.x/tools/data/kinetics).
+5. Due to some differences between [SlowFast](https://github.com/facebookresearch/SlowFast) and MMAction2, there are some gaps between their performances.
+6. Kinetics-710 is used for pretraining, which helps improve the performance on other datasets efficiently. You can find more details in the [paper](https://arxiv.org/abs/2211.09552).
+
+For more details on data preparation, you can refer to
+
+- [preparing_kinetics](/tools/data/kinetics/README.md)
+- [preparing_mit](/tools/data/mit/README.md)
+
+## Test
+
+You can use the following command to test a model.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+Example: test UniFormerV2-B/16 model on Kinetics-400 dataset and dump the result to a pkl file.
+
+```shell
+python tools/test.py configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py \
+    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
+```
+
+For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
+
+## Citation
+
+```BibTeX
+@article{Li2022UniFormerV2SL,
+  title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer},
+  author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Limin Wang and Y. Qiao},
+  journal={ArXiv},
+  year={2022},
+  volume={abs/2211.09552}
+}
+```
diff --git a/configs/recognition/uniformerv2/metafile.yml b/configs/recognition/uniformerv2/metafile.yml
new file mode 100644
index 0000000000..acd35d3443
--- /dev/null
+++ b/configs/recognition/uniformerv2/metafile.yml
@@ -0,0 +1,414 @@
+Collections:
+- Name: UniFormerV2
+  README: configs/recognition/uniformerv2/README.md
+  Paper:
+    URL: https://arxiv.org/abs/2211.09552
+    Title: "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer"
+
+Models:
+  - Name: uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-B/16
+      Pretrained: Kinetics-710
+      Resolution: short-side 320
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 85.8
+        Top 5 Accuracy: 97.1
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb_20221219-203d6aac.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Resolution: short-side 320
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 88.7
+        Top 5 Accuracy: 98.1
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb_20221219-972ea063.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Resolution: short-side 320
+      Frame: 16
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 89.0
+        Top 5 Accuracy: 98.2
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb_20221219-6dc86d05.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Resolution: short-side 320
+      Frame: 32
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 89.3
+        Top 5 Accuracy: 98.2
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb_20221219-56a46f64.pth
+
+  - Name: uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14@336
+      Pretrained: Kinetics-710
+      Resolution: short-side 320
+      Frame: 32
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 89.5
+        Top 5 Accuracy: 98.4
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics400/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb_20221219-1dd7650f.pth
+
+  - Name: uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-B/16
+      Pretrained: Kinetics-710
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-600
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 86.4
+        Top 5 Accuracy: 97.3
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb_20221219-c62c4da4.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-600
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 89.0
+        Top 5 Accuracy: 98.3
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb_20221219-cf88e4c2.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Frame: 16
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-600
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 89.4
+        Top 5 Accuracy: 98.3
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb_20221219-38ff0e3e.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Frame: 32
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-600
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 89.2
+        Top 5 Accuracy: 98.3
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb_20221219-d450d071.pth
+
+  - Name: uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14@336
+      Pretrained: Kinetics-710
+      Frame: 32
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-600
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 89.8
+        Top 5 Accuracy: 98.5
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics600/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb_20221219-f984f5d2.pth
+
+  - Name: uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-B/16
+      Pretrained: Kinetics-710
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-700
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 76.3
+        Top 5 Accuracy: 92.9
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb_20221219-8a7c4ac4.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-700
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 80.8
+        Top 5 Accuracy: 95.2
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb_20221219-bfb9f401.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Frame: 16
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-700
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.2
+        Top 5 Accuracy: 95.6
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb_20221219-745209d2.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710
+      Frame: 32
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-700
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.4
+        Top 5 Accuracy: 95.7
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb_20221219-eebe7056.pth
+
+  - Name: uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14@336
+      Pretrained: Kinetics-710
+      Frame: 32
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 82.1
+        Top 5 Accuracy: 96.0
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics700/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb_20221219-bfb9f401.pth
+
+  - Name: uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-B/16
+      Pretrained: CLIP-400M
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics710/uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb_20221219-77d34f81.pth
+
+  - Name: uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: CLIP-400M
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics710/uniformerv2-large-p14-res224_clip-pre_u8_kinetics710-rgb_20221219-bfaae587.pth
+
+  - Name: uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14@336
+      Pretrained: Kinetics-710
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/kinetics710/uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb_20221219-55878cdc.pth
+
+  - Name: uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-B/16
+      Pretrained: Kinetics-710 + Kinetics-400
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Moments in Time V1
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 42.7
+        Top 5 Accuracy: 71.6
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/mitv1/uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb_20221219-fddbc786.pth
+
+  - Name: uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14
+      Pretrained: Kinetics-710 + Kinetics-400
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Moments in Time V1
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 47.0
+        Top 5 Accuracy: 76.1
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/mitv1/uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb_20221219-882c0598.pth
+
+  - Name: uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb
+    Config: configs/recognition/uniformerv2/uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
+    In Collection: UniFormer
+    Metadata:
+      Architecture: UniFormerV2-L/14@336
+      Pretrained: Kinetics-710 + Kinetics-400
+      Frame: 8
+      Sampling method: Uniform
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/OpenGVLab/UniFormerV2/blob/main/MODEL_ZOO.md
+      Code: https://github.com/OpenGVLab/UniFormerV2
+    Results:
+    - Dataset: Moments in Time V1
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 47.7
+        Top 5 Accuracy: 76.8
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv2/mitv1/uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb_20221219-9020986e.pth
diff --git a/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
new file mode 100644
index 0000000000..a4cae65831
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=16,
+        width=768,
+        layers=12,
+        heads=12,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[8, 9, 10, 11],
+        n_layers=4,
+        n_dim=768,
+        n_head=12,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=339,
+        in_channels=768,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/mit_v1'
+ann_file_test = 'data/mit_v1/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=' '))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py
new file mode 100644
index 0000000000..a3eddb0d04
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=16,
+        width=768,
+        layers=12,
+        heads=12,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[8, 9, 10, 11],
+        n_layers=4,
+        n_dim=768,
+        n_head=12,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=400,
+        in_channels=768,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py
new file mode 100644
index 0000000000..4c91589dbb
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=16,
+        width=768,
+        layers=12,
+        heads=12,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[8, 9, 10, 11],
+        n_layers=4,
+        n_dim=768,
+        n_head=12,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=600,
+        in_channels=768,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k600'
+ann_file_test = 'data/k600/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py
new file mode 100644
index 0000000000..92494df5d7
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=16,
+        width=768,
+        layers=12,
+        heads=12,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[8, 9, 10, 11],
+        n_layers=4,
+        n_dim=768,
+        n_head=12,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=700,
+        in_channels=768,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k700'
+ann_file_test = 'data/k700/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb.py b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb.py
new file mode 100644
index 0000000000..7d055c4fb4
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-base-p16-res224_clip-pre_u8_kinetics710-rgb.py
@@ -0,0 +1,37 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=16,
+        width=768,
+        layers=12,
+        heads=12,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[8, 9, 10, 11],
+        n_layers=4,
+        n_dim=768,
+        n_head=12,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=710,
+        in_channels=768,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb.py
new file mode 100644
index 0000000000..5f21a078f8
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics400-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 16
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=400,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb.py
new file mode 100644
index 0000000000..284c313e3d
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics600-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 16
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=600,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k600'
+ann_file_test = 'data/k600/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb.py
new file mode 100644
index 0000000000..f137564572
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u16_kinetics700-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 16
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=700,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k700'
+ann_file_test = 'data/k700/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb.py
new file mode 100644
index 0000000000..94b92cf99e
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics400-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 32
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=400,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb.py
new file mode 100644
index 0000000000..7a7ba254df
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics600-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 32
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=600,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k600'
+ann_file_test = 'data/k600/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb.py
new file mode 100644
index 0000000000..abf8ff5f06
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u32_kinetics700-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 32
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=700,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k700'
+ann_file_test = 'data/k700/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py
new file mode 100644
index 0000000000..751a1cc7a8
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics400-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=400,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py
new file mode 100644
index 0000000000..ea6eea9a9a
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics600-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=600,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k600'
+ann_file_test = 'data/k600/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py
new file mode 100644
index 0000000000..b68593afa3
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-kinetics710-pre_u8_kinetics700-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=700,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k700'
+ann_file_test = 'data/k700/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-pre_u8_kinetics710-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-pre_u8_kinetics710-rgb.py
new file mode 100644
index 0000000000..46a60758d8
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res224_clip-pre_u8_kinetics710-rgb.py
@@ -0,0 +1,37 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=710,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb.py
new file mode 100644
index 0000000000..5385c2aa07
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics400-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 32
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=336,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=400,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k400'
+ann_file_test = 'data/k400/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=2,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 336)),
+    dict(type='ThreeCrop', crop_size=336),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=4,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb.py
new file mode 100644
index 0000000000..3e495771bc
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics600-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 32
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=336,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=600,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k600'
+ann_file_test = 'data/k600/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=2,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 336)),
+    dict(type='ThreeCrop', crop_size=336),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=4,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb.py
new file mode 100644
index 0000000000..9a09934ca0
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-kinetics710-pre_u32_kinetics700-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 32
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=336,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=700,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/k700'
+ann_file_test = 'data/k700/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=2,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 336)),
+    dict(type='ThreeCrop', crop_size=336),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=4,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=','))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb.py
new file mode 100644
index 0000000000..e47b8a7148
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p14-res336_clip-pre_u8_kinetics710-rgb.py
@@ -0,0 +1,37 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 32
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=336,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=710,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
new file mode 100644
index 0000000000..19af3d1eac
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p16-res224_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=224,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=339,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/mit_v1'
+ann_file_test = 'data/mit_v1/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=' '))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/uniformerv2/uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py b/configs/recognition/uniformerv2/uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
new file mode 100644
index 0000000000..7842bf1164
--- /dev/null
+++ b/configs/recognition/uniformerv2/uniformerv2-large-p16-res336_clip-kinetics710-kinetics-k400-pre_u8_mitv1-rgb.py
@@ -0,0 +1,70 @@
+_base_ = ['../../configs/_base_/default_runtime.py']
+
+# model settings
+num_frames = 8
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='UniFormerV2',
+        input_resolution=336,
+        patch_size=14,
+        width=1024,
+        layers=24,
+        heads=16,
+        t_size=num_frames,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[20, 21, 22, 23],
+        n_layers=4,
+        n_dim=1024,
+        n_head=16,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5]),
+    cls_head=dict(
+        type='TimeSformerHead',
+        dropout_ratio=0.5,
+        num_classes=339,
+        in_channels=1024,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/mit_v1'
+ann_file_test = 'data/mit_v1/val.csv'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='UniformSample', clip_len=num_frames, num_clips=4,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 336)),
+    dict(type='ThreeCrop', crop_size=336),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True,
+        delimiter=' '))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/demo/README.md b/demo/README.md
index 88e4c96bf8..f3f4ba1db9 100644
--- a/demo/README.md
+++ b/demo/README.md
@@ -7,6 +7,8 @@
 - [Video GradCAM Demo](#video-gradcam-demo): A demo script to visualize GradCAM results using a single video.
 - [Webcam demo](#webcam-demo): A demo script to implement real-time action recognition from a web camera.
 - [Skeleton-based Action Recognition Demo](#skeleton-based-action-recognition-demo): A demo script to predict the skeleton-based action recognition result using a single video.
+- [SpatioTemporal Action Detection Video Demo](#spatiotemporal-action-detection-video-demo): A demo script to predict the spatiotemporal action detection result using a single video.
+- [Inferencer Demo](#inferencer): A demo script to implement fast predict for video analysis tasks based on unified inferencer interface.
 
 ## Modify configs through script arguments
 
@@ -52,13 +54,13 @@ Optional arguments:
 Examples:
 
 Assume that you are located at `$MMACTION2` and have already downloaded the checkpoints to the directory `checkpoints/`,
-or use checkpoint url from to directly load corresponding checkpoint, which will be automatically saved in `$HOME/.cache/torch/checkpoints`.
+or use checkpoint url from `configs/` to directly load corresponding checkpoint, which will be automatically saved in `$HOME/.cache/torch/checkpoints`.
 
 1. Recognize a video file as input by using a TSN model on cuda by default.
 
    ```shell
    # The demo.mp4 and label_map_k400.txt are both from Kinetics-400
-   python demo/demo.py configs/recognition/tsn/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb.py \
+   python demo/demo.py demo/demo_configs/tsn_r50_1x1x8_video_infer.py \
        checkpoints/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth \
        demo/demo.mp4 tools/data/kinetics/label_map_k400.txt
    ```
@@ -67,7 +69,7 @@ or use checkpoint url from to directly load corresponding checkpoint, which will
 
    ```shell
    # The demo.mp4 and label_map_k400.txt are both from Kinetics-400
-   python demo/demo.py configs/recognition/tsn/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb.py \
+   python demo/demo.py demo/demo_configs/tsn_r50_1x1x8_video_infer.py \
        https://download.openmmlab.com/mmaction/v1.0/recognition/tsn/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth \
        demo/demo.mp4 tools/data/kinetics/label_map_k400.txt
    ```
@@ -76,7 +78,7 @@ or use checkpoint url from to directly load corresponding checkpoint, which will
 
    ```shell
    # The demo.mp4 and label_map_k400.txt are both from Kinetics-400
-   python demo/demo.py configs/recognition/tsn/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb.py \
+   python demo/demo.py demo/demo_configs/tsn_r50_1x1x8_video_infer.py \
        checkpoints/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth \
        demo/demo.mp4 tools/data/kinetics/label_map_k400.txt --out-filename demo/demo_out.mp4
    ```
@@ -86,7 +88,7 @@ or use checkpoint url from to directly load corresponding checkpoint, which will
 MMAction2 provides a demo script to visualize GradCAM results using a single video.
 
 ```shell
-python demo/demo_gradcam.py ${CONFIG_FILE} ${CHECKPOINT_FILE} ${VIDEO_FILE} [--use-frames] \
+python tools/visualizations/vis_cam.py ${CONFIG_FILE} ${CHECKPOINT_FILE} ${VIDEO_FILE} [--use-frames] \
     [--device ${DEVICE_TYPE}] [--target-layer-name ${TARGET_LAYER_NAME}] [--fps {FPS}] \
     [--target-resolution ${TARGET_RESOLUTION}] [--resize-algorithm {RESIZE_ALGORITHM}] [--out-filename {OUT_FILE}]
 ```
@@ -107,7 +109,7 @@ or use checkpoint url from `configs/` to directly load corresponding checkpoint,
 1. Get GradCAM results of a I3D model, using a video file as input and then generate an gif file with 10 fps.
 
    ```shell
-   python demo/demo_gradcam.py demo/demo_configs/i3d_r50_32x2x1_video_infer.py \
+   python tools/visualizations/vis_cam.py demo/demo_configs/i3d_r50_32x2x1_video_infer.py \
        checkpoints/i3d_imagenet-pretrained-r50_8xb8-32x2x1-100e_kinetics400-rgb_20220812-e213c223.pth demo/demo.mp4 \
        --target-layer-name backbone/layer4/1/relu --fps 10 \
        --out-filename demo/demo_gradcam.gif
@@ -116,7 +118,7 @@ or use checkpoint url from `configs/` to directly load corresponding checkpoint,
 2. Get GradCAM results of a TSN model, using a video file as input and then generate an gif file, loading checkpoint from url.
 
    ```shell
-   python demo/demo_gradcam.py demo/demo_configs/tsn_r50_1x1x8_video_infer.py \
+   python tools/visualizations/vis_cam.py demo/demo_configs/tsn_r50_1x1x8_video_infer.py \
        https://download.openmmlab.com/mmaction/v1.0/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-dense-1x1x5-100e_kinetics400-rgb/tsn_imagenet-pretrained-r50_8xb32-dense-1x1x5-100e_kinetics400-rgb_20220906-dcbc6e01.pth \
        demo/demo.mp4 --target-layer-name backbone/layer4/1/relu --out-filename demo/demo_gradcam_tsn.gif
    ```
@@ -183,7 +185,7 @@ Users can change:
 
 ## Skeleton-based Action Recognition Demo
 
-MMAction2 provides an demo script to predict the skeleton-based action recognition result using a single video.
+MMAction2 provides a demo script to predict the skeleton-based action recognition result using a single video.
 
 ```shell
 python demo/demo_skeleton.py ${VIDEO_FILE} ${OUT_FILENAME} \
@@ -247,3 +249,122 @@ python demo/demo_skeleton.py demo/demo_skeleton.mp4 demo/demo_skeleton_out.mp4 \
     --pose-checkpoint https://download.openmmlab.com/mmpose/top_down/hrnet/hrnet_w32_coco_256x192-c78dce93_20200708.pth \
     --label-map tools/data/skeleton/label_map_ntu60.txt
 ```
+
+## SpatioTemporal Action Detection Video Demo
+
+MMAction2 provides a demo script to predict the SpatioTemporal Action Detection result using a single video.
+
+```shell
+python demo/demo_spatiotemporal_det.py --video ${VIDEO_FILE} \
+    [--out-filename ${OUTPUT_FILENAME}] \
+    [--config ${SPATIOTEMPORAL_ACTION_DETECTION_CONFIG_FILE}] \
+    [--checkpoint ${SPATIOTEMPORAL_ACTION_DETECTION_CHECKPOINT}] \
+    [--det-config ${HUMAN_DETECTION_CONFIG_FILE}] \
+    [--det-checkpoint ${HUMAN_DETECTION_CHECKPOINT}] \
+    [--det-score-thr ${HUMAN_DETECTION_SCORE_THRESHOLD}] \
+    [--det-cat-id ${HUMAN_DETECTION_CATEGORY_ID}] \
+    [--action-score-thr ${ACTION_DETECTION_SCORE_THRESHOLD}] \
+    [--label-map ${LABEL_MAP}] \
+    [--device ${DEVICE}] \
+    [--short-side] ${SHORT_SIDE} \
+    [--predict-stepsize ${PREDICT_STEPSIZE}] \
+    [--output-stepsize ${OUTPUT_STEPSIZE}] \
+    [--output-fps ${OUTPUT_FPS}]
+```
+
+Optional arguments:
+
+- `OUTPUT_FILENAME`: Path to the output file which is a video format. Defaults to `demo/stdet_demo.mp4`.
+- `SPATIOTEMPORAL_ACTION_DETECTION_CONFIG_FILE`: The spatiotemporal action detection config file path.
+- `SPATIOTEMPORAL_ACTION_DETECTION_CHECKPOINT`: The spatiotemporal action detection checkpoint URL.
+- `HUMAN_DETECTION_CONFIG_FILE`: The human detection config file path.
+- `HUMAN_DETECTION_CHECKPOINT`: The human detection checkpoint URL.
+- `HUMAN_DETECTION_SCORE_THRESHOLD`: The score threshold for human detection. Defaults to 0.9.
+- `HUMAN_DETECTION_CATEGORY_ID`: The category id for human detection. Defaults to 0.
+- `ACTION_DETECTION_SCORE_THRESHOLD`: The score threshold for action detection. Defaults to 0.5.
+- `LABEL_MAP`: The label map used. Defaults to `tools/data/ava/label_map.txt`.
+- `DEVICE`: Type of device to run the demo. Allowed values are cuda device like `cuda:0` or `cpu`.  Defaults to `cuda:0`.
+- `SHORT_SIDE`: The short side used for frame extraction. Defaults to 256.
+- `PREDICT_STEPSIZE`: Make a prediction per N frames.  Defaults to 8.
+- `OUTPUT_STEPSIZE`: Output 1 frame per N frames in the input video. Note that `PREDICT_STEPSIZE % OUTPUT_STEPSIZE == 0`. Defaults to 4.
+- `OUTPUT_FPS`: The FPS of demo video output. Defaults to 6.
+
+Examples:
+
+Assume that you are located at `$MMACTION2` .
+
+1. Use the Faster RCNN as the human detector, SlowOnly-8x8-R101 as the action detector. Making predictions per 8 frames, and output 1 frame per 4 frames to the output video. The FPS of the output video is 4.
+
+```shell
+python demo/demo_spatiotemporal_det.py demo/demo.mp4 demo/demo_spatiotemporal_det.mp4 \
+    --config configs/detection/ava/slowonly_kinetics400-pretrained-r101_8xb16-8x8x1-20e_ava21-rgb.py \
+    --checkpoint https://download.openmmlab.com/mmaction/detection/ava/slowonly_omnisource_pretrained_r101_8x8x1_20e_ava_rgb/slowonly_omnisource_pretrained_r101_8x8x1_20e_ava_rgb_20201217-16378594.pth \
+    --det-config demo/demo_configs/faster-rcnn_r50_fpn_2x_coco_infer.py \
+    --det-checkpoint http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth \
+    --det-score-thr 0.9 \
+    --action-score-thr 0.5 \
+    --label-map tools/data/ava/label_map.txt \
+    --predict-stepsize 8 \
+    --output-stepsize 4 \
+    --output-fps 6
+```
+
+## Inferencer
+
+MMAction2 provides a demo script to implement fast prediction for video analysis tasks based on unified inferencer interface, currently only supports action recognition task.
+
+```shell
+python demo/demo.py ${INPUTS} \
+    [--vid-out-dir ${VID_OUT_DIR}] \
+    [--rec ${RECOG_TASK}] \
+    [--rec-weights ${RECOG_WEIGHTS}] \
+    [--label-file ${LABEL_FILE}] \
+    [--device ${DEVICE_TYPE}] \
+    [--batch-size ${BATCH_SIZE}] \
+    [--print-result ${PRINT_RESULT}] \
+    [--pred-out-file ${PRED_OUT_FILE} ]
+```
+
+Optional arguments:
+
+- `--show`: If specified, the demo will display the video in a popup window.
+- `--print-result`: If specified, the demo will print the inference results'
+- `VID_OUT_DIR`: Output directory of saved videos. Defaults to None, means not to save videos.
+- `RECOG_TASK`: Type of Action Recognition algorithm. It could be the path to the config file, the model name or alias defined in metafile.
+- `RECOG_WEIGHTS`: Path to the custom checkpoint file of the selected recog model. If it is not specified and "rec" is a model name of metafile, the weights will be loaded from metafile.
+- `LABEL_FILE`: Label file for dataset the algorithm pretrained on. Defaults to None, means don't show label in result.
+- `DEVICE_TYPE`: Type of device to run the demo. Allowed values are cuda device like `cuda:0` or `cpu`. Defaults to `cuda:0`.
+- `BATCH_SIZE`: The batch size used in inference. Defaults to 1.
+- `PRED_OUT_FILE`: File path to save the inference results. Defaults to None, means not to save prediction results.
+
+Examples:
+
+Assume that you are located at `$MMACTION2`.
+
+1. Recognize a video file as input by using a TSN model, loading checkpoint from metafile.
+
+   ```shell
+   # The demo.mp4 and label_map_k400.txt are both from Kinetics-400
+   python demo/demo_inferencer.py demo/demo.mp4 \
+       --rec configs/recognition/tsn/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb.py \
+       --label-file tools/data/kinetics/label_map_k400.txt
+   ```
+
+2. Recognize a video file as input by using a TSN model, using model alias in metafile.
+
+   ```shell
+   # The demo.mp4 and label_map_k400.txt are both from Kinetics-400
+   python demo/demo_inferencer.py demo/demo.mp4 \
+       --rec tsn \
+       --label-file tools/data/kinetics/label_map_k400.txt
+   ```
+
+3. Recognize a video file as input by using a TSN model, and then save visulization video.
+
+   ```shell
+   # The demo.mp4 and label_map_k400.txt are both from Kinetics-400
+   python demo/demo_inferencer.py demo/demo.mp4 \
+       --vid-out-dir demo_out \
+       --rec tsn \
+       --label-file tools/data/kinetics/label_map_k400.txt
+   ```
diff --git a/demo/demo.ipynb b/demo/demo.ipynb
index 2f1c764426..ebcf2ff538 100644
--- a/demo/demo.ipynb
+++ b/demo/demo.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 10,
    "metadata": {
     "collapsed": true,
     "pycharm": {
@@ -11,66 +11,58 @@
    },
    "outputs": [],
    "source": [
-    "from mmaction.apis import init_recognizer, inference_recognizer\n",
-    "from mmaction.utils import register_all_modules"
+    "from operator import itemgetter\n",
+    "from mmaction.apis import init_recognizer, inference_recognizer"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
-   "outputs": [],
-   "source": [
-    "register_all_modules()  # register all modules and set mmaction2 as the default scope."
-   ],
+   "execution_count": 4,
    "metadata": {
     "collapsed": false,
     "pycharm": {
      "name": "#%%\n"
     }
-   }
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
+   },
    "outputs": [],
    "source": [
-    "config_file = '../configs/recognition/tsn/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb.py'\n",
+    "config_file = '../demo/demo_configs/tsn_r50_1x1x8_video_infer.py'\n",
     "# download the checkpoint from model zoo and put it in `checkpoints/`\n",
     "checkpoint_file = '../checkpoints/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth'"
-   ],
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
    "metadata": {
     "collapsed": false,
     "pycharm": {
      "name": "#%%\n"
     }
-   }
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
+   },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "local loads checkpoint from path: ../checkpoints/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth\n"
+      "Loads checkpoint by local backend from path: ../checkpoints/tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth\n"
      ]
     }
    ],
    "source": [
     "# build the model from a config file and a checkpoint file\n",
     "model = init_recognizer(config_file, checkpoint_file, device='cpu')"
-   ],
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
    "metadata": {
     "collapsed": false,
     "pycharm": {
      "name": "#%%\n"
     }
-   }
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
+   },
    "outputs": [],
    "source": [
     "# test a single video and show the result:\n",
@@ -78,30 +70,35 @@
     "label = '../tools/data/kinetics/label_map_k400.txt'\n",
     "results = inference_recognizer(model, video)\n",
     "\n",
+    "pred_scores = results.pred_scores.item.tolist()\n",
+    "score_tuples = tuple(zip(range(len(pred_scores)), pred_scores))\n",
+    "score_sorted = sorted(score_tuples, key=itemgetter(1), reverse=True)\n",
+    "top5_label = score_sorted[:5]\n",
+    "\n",
     "labels = open(label).readlines()\n",
     "labels = [x.strip() for x in labels]\n",
-    "results = [(labels[k[0]], k[1]) for k in results]"
-   ],
+    "results = [(labels[k[0]], k[1]) for k in top5_label]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
    "metadata": {
     "collapsed": false,
     "pycharm": {
      "name": "#%%\n"
     }
-   }
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
+   },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "arm wrestling:  50.61515426635742\n",
-      "rock scissors paper:  16.606340408325195\n",
-      "massaging feet:  15.414356231689453\n",
-      "stretching leg:  13.792497634887695\n",
-      "bench pressing:  13.432787895202637\n"
+      "arm wrestling:  1.0\n",
+      "rock scissors paper:  1.698846019067312e-15\n",
+      "massaging feet:  5.157996544393221e-16\n",
+      "stretching leg:  1.018867278715779e-16\n",
+      "bench pressing:  7.110452486439706e-17\n"
      ]
     }
    ],
@@ -109,32 +106,31 @@
     "# show the results\n",
     "for result in results:\n",
     "    print(f'{result[0]}: ', result[1])"
-   ],
-   "metadata": {
-    "collapsed": false,
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   }
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "mmact_dev",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.6"
+   "pygments_lexer": "ipython3",
+   "version": "3.7.13 (default, Mar 29 2022, 02:18:16) \n[GCC 7.5.0]"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "189c342a4747645665e89db23000ac4d4edb7a87c4cd0b2f881610f468fb778d"
+   }
   }
  },
  "nbformat": 4,
diff --git a/demo/demo.py b/demo/demo.py
index 5cebcd3abe..6c9b5db5a5 100644
--- a/demo/demo.py
+++ b/demo/demo.py
@@ -4,11 +4,9 @@
 from operator import itemgetter
 from typing import Optional, Tuple
 
-import cv2
 from mmengine import Config, DictAction
 
 from mmaction.apis import inference_recognizer, init_recognizer
-from mmaction.utils import register_all_modules
 from mmaction.visualization import ActionVisualizer
 
 
@@ -88,34 +86,9 @@ def get_output(
     if video_path.startswith(('http://', 'https://')):
         raise NotImplementedError
 
-    try:
-        import decord
-    except ImportError:
-        raise ImportError('Please install decord to enable output file.')
-
-    # Channel Order is `BGR`
-    video = decord.VideoReader(video_path)
-    frames = [x.asnumpy()[..., ::-1] for x in video]
-    if target_resolution:
-        w, h = target_resolution
-        frame_h, frame_w, _ = frames[0].shape
-        if w == -1:
-            w = int(h / frame_h * frame_w)
-        if h == -1:
-            h = int(w / frame_w * frame_h)
-        frames = [cv2.resize(f, (w, h)) for f in frames]
-
     # init visualizer
     out_type = 'gif' if osp.splitext(out_filename)[1] == '.gif' else 'video'
-    vis_backends_cfg = [
-        dict(
-            type='LocalVisBackend',
-            out_type=out_type,
-            save_dir='demo',
-            fps=fps)
-    ]
-    visualizer = ActionVisualizer(
-        vis_backends=vis_backends_cfg, save_dir='place_holder')
+    visualizer = ActionVisualizer()
     visualizer.dataset_meta = dict(classes=labels)
 
     text_cfg = {'colors': font_color}
@@ -124,19 +97,20 @@ def get_output(
 
     visualizer.add_datasample(
         out_filename,
-        frames,
+        video_path,
         data_sample,
         draw_pred=True,
         draw_gt=False,
-        text_cfg=text_cfg)
+        text_cfg=text_cfg,
+        fps=fps,
+        out_type=out_type,
+        out_path=osp.join('demo', out_filename),
+        target_resolution=target_resolution)
 
 
 def main():
     args = parse_args()
 
-    # Register all modules in mmaction2 into the registries
-    register_all_modules()
-
     cfg = Config.fromfile(args.config)
     if args.cfg_options is not None:
         cfg.merge_from_dict(args.cfg_options)
diff --git a/demo/demo_inferencer.py b/demo/demo_inferencer.py
new file mode 100644
index 0000000000..f7a7f365e9
--- /dev/null
+++ b/demo/demo_inferencer.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from argparse import ArgumentParser
+
+from mmaction.apis.inferencers import MMAction2Inferencer
+
+
+def parse_args():
+    parser = ArgumentParser()
+    parser.add_argument(
+        'inputs', type=str, help='Input video file or rawframes folder path.')
+    parser.add_argument(
+        '--vid-out-dir',
+        type=str,
+        default='',
+        help='Output directory of videos.')
+    parser.add_argument(
+        '--rec',
+        type=str,
+        default=None,
+        help='Pretrained action recognition algorithm. It\'s the path to the '
+        'config file or the model name defined in metafile.')
+    parser.add_argument(
+        '--rec-weights',
+        type=str,
+        default=None,
+        help='Path to the custom checkpoint file of the selected recog model. '
+        'If it is not specified and "rec" is a model name of metafile, the '
+        'weights will be loaded from metafile.')
+    parser.add_argument(
+        '--label-file', type=str, default=None, help='label file for dataset.')
+    parser.add_argument(
+        '--device',
+        type=str,
+        default=None,
+        help='Device used for inference. '
+        'If not specified, the available device will be automatically used.')
+    parser.add_argument(
+        '--batch-size', type=int, default=1, help='Inference batch size.')
+    parser.add_argument(
+        '--show',
+        action='store_true',
+        help='Display the video in a popup window.')
+    parser.add_argument(
+        '--print-result',
+        action='store_true',
+        help='Whether to print the results.')
+    parser.add_argument(
+        '--pred-out-file',
+        type=str,
+        default='',
+        help='File to save the inference results.')
+
+    call_args = vars(parser.parse_args())
+
+    init_kws = ['rec', 'rec_weights', 'device', 'label_file']
+    init_args = {}
+    for init_kw in init_kws:
+        init_args[init_kw] = call_args.pop(init_kw)
+
+    return init_args, call_args
+
+
+def main():
+    init_args, call_args = parse_args()
+    mmaction2 = MMAction2Inferencer(**init_args)
+    mmaction2(**call_args)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/demo/demo_skeleton.py b/demo/demo_skeleton.py
index 98ce13f1bd..57c84c90a3 100644
--- a/demo/demo_skeleton.py
+++ b/demo/demo_skeleton.py
@@ -14,7 +14,7 @@
 from mmaction.apis import (detection_inference, inference_recognizer,
                            init_recognizer, pose_inference)
 from mmaction.registry import VISUALIZERS
-from mmaction.utils import frame_extract, register_all_modules
+from mmaction.utils import frame_extract
 
 try:
     import moviepy.editor as mpy
@@ -168,12 +168,8 @@ def main():
     fake_anno['keypoint'] = keypoint.transpose((1, 0, 2, 3))
     fake_anno['keypoint_score'] = keypoint_score.transpose((1, 0, 2))
 
-    register_all_modules()
     config = mmengine.Config.fromfile(args.config)
     config.merge_from_dict(args.cfg_options)
-    if 'data_preprocessor' in config.model:
-        config.model.data_preprocessor['mean'] = (w // 2, h // 2, .5)
-        config.model.data_preprocessor['std'] = (w, h, 1.)
 
     model = init_recognizer(config, args.checkpoint, args.device)
     result = inference_recognizer(model, fake_anno)
diff --git a/demo/demo_spatiotemporal_det.py b/demo/demo_spatiotemporal_det.py
index a8e49c1020..009a9475a6 100644
--- a/demo/demo_spatiotemporal_det.py
+++ b/demo/demo_spatiotemporal_det.py
@@ -17,7 +17,6 @@
 from mmaction.apis import detection_inference
 from mmaction.registry import MODELS
 from mmaction.structures import ActionDataSample
-from mmaction.utils import register_all_modules
 
 try:
     import moviepy.editor as mpy
@@ -179,8 +178,7 @@ def pack_result(human_detection, result, img_h, img_w):
 def parse_args():
     parser = argparse.ArgumentParser(description='MMAction2 demo')
     parser.add_argument('video', help='video file/url')
-    parser.add_argument(
-        'out_filename', help='output filename', default='demo/stdet_demo.mp4')
+    parser.add_argument('out_filename', help='output filename')
     parser.add_argument(
         '--config',
         default=('configs/detection/ava/slowonly_kinetics400-pretrained-'
@@ -195,7 +193,7 @@ def parse_args():
         help='spatialtemporal detection model checkpoint file/url')
     parser.add_argument(
         '--det-config',
-        default='demo/skeleton_demo_cfg/faster-rcnn_r50_fpn_2x_coco_infer.py',
+        default='demo/demo_configs/faster-rcnn_r50_fpn_2x_coco_infer.py',
         help='human detection config file path (from mmdet)')
     parser.add_argument(
         '--det-checkpoint',
@@ -260,7 +258,6 @@ def parse_args():
 
 def main():
     args = parse_args()
-    register_all_modules()
 
     frame_paths, original_frames = frame_extraction(args.video)
     num_frame = len(frame_paths)
diff --git a/demo/mmaction2_tutorial.ipynb b/demo/mmaction2_tutorial.ipynb
index 98884de80b..f0f5fb9d03 100644
--- a/demo/mmaction2_tutorial.ipynb
+++ b/demo/mmaction2_tutorial.ipynb
@@ -1,15 +1,15 @@
 {
   "cells": [
     {
-   "cell_type": "markdown",
-   "metadata": {
-    "colab_type": "text",
-    "id": "view-in-github"
-   },
-   "source": [
-    "<a href=\"https://colab.research.google.com/github/open-mmlab/mmaction2/blob/dev-1.x/demo/mmaction2_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-   ]
-  },
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "view-in-github"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/open-mmlab/mmaction2/blob/dev-1.x/demo/mmaction2_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -47,8 +47,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "nvcc: NVIDIA (R) Cuda compiler driver\n",
             "Copyright (c) 2005-2020 NVIDIA Corporation\n",
@@ -82,8 +82,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
             "Looking in links: https://download.pytorch.org/whl/torch_stable.html\n",
@@ -360,8 +360,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "1.9.0+cu111 True\n",
             "1.0.0rc0\n",
@@ -412,8 +412,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "--2022-09-27 02:52:15--  https://download.openmmlab.com/mmaction/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth\n",
             "Resolving download.openmmlab.com (download.openmmlab.com)... 161.117.242.67\n",
@@ -447,8 +447,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "local loads checkpoint from path: checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth\n"
           ]
@@ -456,10 +456,8 @@
       ],
       "source": [
         "from mmaction.apis import inference_recognizer, init_recognizer\n",
-        "from mmaction.utils import register_all_modules\n",
         "from mmengine import Config\n",
         "\n",
-        "register_all_modules()\n",
         "\n",
         "# Choose to use a config and initialize the recognizer\n",
         "config = 'configs/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x3-100e_kinetics400-rgb.py'\n",
@@ -506,8 +504,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "The top-5 labels with corresponding scores are:\n",
             "arm wrestling:  29.61644172668457\n",
@@ -563,8 +561,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "rm: cannot remove 'kinetics400_tiny.zip*': No such file or directory\n",
             "--2022-09-27 02:57:21--  https://download.openmmlab.com/mmaction/kinetics400_tiny.zip\n",
@@ -601,8 +599,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "Reading package lists...\n",
             "Building dependency tree...\n",
@@ -693,8 +691,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "D32_1gwq35E.mp4 0\n",
             "iRuyZSKhHRg.mp4 1\n",
@@ -787,8 +785,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "Config:\n",
             "model = dict(\n",
@@ -1063,8 +1061,8 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "09/27 02:57:47 - mmengine - \u001b[4m\u001b[37mINFO\u001b[0m - \n",
             "------------------------------------------------------------\n",
@@ -1301,29 +1299,29 @@
           ]
         },
         {
-          "output_type": "stream",
           "name": "stderr",
+          "output_type": "stream",
           "text": [
             "Downloading: \"https://download.pytorch.org/models/resnet50-11ad3fa6.pth\" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth\n"
           ]
         },
         {
-          "output_type": "display_data",
           "data": {
-            "text/plain": [
-              "  0%|          | 0.00/97.8M [00:00<?, ?B/s]"
-            ],
             "application/vnd.jupyter.widget-view+json": {
+              "model_id": "40af3bfc9e0a486cb95f50766c76787f",
               "version_major": 2,
-              "version_minor": 0,
-              "model_id": "40af3bfc9e0a486cb95f50766c76787f"
-            }
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  0%|          | 0.00/97.8M [00:00<?, ?B/s]"
+            ]
           },
-          "metadata": {}
+          "metadata": {},
+          "output_type": "display_data"
         },
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "09/27 02:57:49 - mmengine - \u001b[4m\u001b[37mINFO\u001b[0m - These parameters in pretrained checkpoint are not loaded: {'fc.bias', 'fc.weight'}\n",
             "local loads checkpoint from path: ./checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth\n",
@@ -1377,7 +1375,6 @@
           ]
         },
         {
-          "output_type": "execute_result",
           "data": {
             "text/plain": [
               "Recognizer2D(\n",
@@ -1696,8 +1693,9 @@
               ")"
             ]
           },
+          "execution_count": 14,
           "metadata": {},
-          "execution_count": 14
+          "output_type": "execute_result"
         }
       ],
       "source": [
@@ -1757,21 +1755,21 @@
       },
       "outputs": [
         {
-          "output_type": "stream",
           "name": "stdout",
+          "output_type": "stream",
           "text": [
             "09/27 03:00:53 - mmengine - \u001b[4m\u001b[37mINFO\u001b[0m - Epoch(test) [10/10]  acc/top1: 1.0000  acc/top5: 1.0000  acc/mean1: 1.0000\n"
           ]
         },
         {
-          "output_type": "execute_result",
           "data": {
             "text/plain": [
               "{'acc/top1': 1.0, 'acc/top5': 1.0, 'acc/mean1': 1.0}"
             ]
           },
+          "execution_count": 22,
           "metadata": {},
-          "execution_count": 22
+          "output_type": "execute_result"
         }
       ],
       "source": [
@@ -1787,7 +1785,7 @@
       "toc_visible": true
     },
     "kernelspec": {
-      "display_name": "Python 3 (ipykernel)",
+      "display_name": "mmact_dev",
       "language": "python",
       "name": "python3"
     },
@@ -1801,57 +1799,19 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.8.1"
+      "version": "3.7.13 (default, Mar 29 2022, 02:18:16) \n[GCC 7.5.0]"
+    },
+    "vscode": {
+      "interpreter": {
+        "hash": "189c342a4747645665e89db23000ac4d4edb7a87c4cd0b2f881610f468fb778d"
+      }
     },
     "widgets": {
       "application/vnd.jupyter.widget-state+json": {
-        "40af3bfc9e0a486cb95f50766c76787f": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_name": "HBoxModel",
-          "model_module_version": "1.5.0",
-          "state": {
-            "_dom_classes": [],
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "HBoxModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/controls",
-            "_view_module_version": "1.5.0",
-            "_view_name": "HBoxView",
-            "box_style": "",
-            "children": [
-              "IPY_MODEL_e9024658f55f4a4e9f40c76cf3510377",
-              "IPY_MODEL_1f7168d5d48942ec960a3dd2734ae4b3",
-              "IPY_MODEL_ea430aaddedc43d6862d13470da78261"
-            ],
-            "layout": "IPY_MODEL_46371a881bd0487eb956fe1552d33bb2"
-          }
-        },
-        "e9024658f55f4a4e9f40c76cf3510377": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_name": "HTMLModel",
-          "model_module_version": "1.5.0",
-          "state": {
-            "_dom_classes": [],
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "HTMLModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/controls",
-            "_view_module_version": "1.5.0",
-            "_view_name": "HTMLView",
-            "description": "",
-            "description_tooltip": null,
-            "layout": "IPY_MODEL_8ce5458513694d43a5f6e03ef4c94219",
-            "placeholder": "​",
-            "style": "IPY_MODEL_efef671ff3bd491199f713ef0dc30188",
-            "value": "100%"
-          }
-        },
         "1f7168d5d48942ec960a3dd2734ae4b3": {
           "model_module": "@jupyter-widgets/controls",
-          "model_name": "FloatProgressModel",
           "model_module_version": "1.5.0",
+          "model_name": "FloatProgressModel",
           "state": {
             "_dom_classes": [],
             "_model_module": "@jupyter-widgets/controls",
@@ -1872,31 +1832,10 @@
             "value": 102540417
           }
         },
-        "ea430aaddedc43d6862d13470da78261": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_name": "HTMLModel",
-          "model_module_version": "1.5.0",
-          "state": {
-            "_dom_classes": [],
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "HTMLModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/controls",
-            "_view_module_version": "1.5.0",
-            "_view_name": "HTMLView",
-            "description": "",
-            "description_tooltip": null,
-            "layout": "IPY_MODEL_2d14de40880f48c5809bcd1391e1b564",
-            "placeholder": "​",
-            "style": "IPY_MODEL_c21c25b1f9f347aaa0428e6528e3a3e0",
-            "value": " 97.8M/97.8M [00:00&lt;00:00, 223MB/s]"
-          }
-        },
-        "46371a881bd0487eb956fe1552d33bb2": {
+        "2d14de40880f48c5809bcd1391e1b564": {
           "model_module": "@jupyter-widgets/base",
-          "model_name": "LayoutModel",
           "model_module_version": "1.2.0",
+          "model_name": "LayoutModel",
           "state": {
             "_model_module": "@jupyter-widgets/base",
             "_model_module_version": "1.2.0",
@@ -1945,10 +1884,32 @@
             "width": null
           }
         },
-        "8ce5458513694d43a5f6e03ef4c94219": {
+        "40af3bfc9e0a486cb95f50766c76787f": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_module_version": "1.5.0",
+          "model_name": "HBoxModel",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HBoxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HBoxView",
+            "box_style": "",
+            "children": [
+              "IPY_MODEL_e9024658f55f4a4e9f40c76cf3510377",
+              "IPY_MODEL_1f7168d5d48942ec960a3dd2734ae4b3",
+              "IPY_MODEL_ea430aaddedc43d6862d13470da78261"
+            ],
+            "layout": "IPY_MODEL_46371a881bd0487eb956fe1552d33bb2"
+          }
+        },
+        "46371a881bd0487eb956fe1552d33bb2": {
           "model_module": "@jupyter-widgets/base",
-          "model_name": "LayoutModel",
           "model_module_version": "1.2.0",
+          "model_name": "LayoutModel",
           "state": {
             "_model_module": "@jupyter-widgets/base",
             "_model_module_version": "1.2.0",
@@ -1997,25 +1958,10 @@
             "width": null
           }
         },
-        "efef671ff3bd491199f713ef0dc30188": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_name": "DescriptionStyleModel",
-          "model_module_version": "1.5.0",
-          "state": {
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "DescriptionStyleModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "StyleView",
-            "description_width": ""
-          }
-        },
-        "b2ee09effef844ee811df107d556c1ec": {
+        "8ce5458513694d43a5f6e03ef4c94219": {
           "model_module": "@jupyter-widgets/base",
-          "model_name": "LayoutModel",
           "model_module_version": "1.2.0",
+          "model_name": "LayoutModel",
           "state": {
             "_model_module": "@jupyter-widgets/base",
             "_model_module_version": "1.2.0",
@@ -2064,26 +2010,10 @@
             "width": null
           }
         },
-        "f55845dcbfa445f18884408313d7a229": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_name": "ProgressStyleModel",
-          "model_module_version": "1.5.0",
-          "state": {
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "ProgressStyleModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "StyleView",
-            "bar_color": null,
-            "description_width": ""
-          }
-        },
-        "2d14de40880f48c5809bcd1391e1b564": {
+        "b2ee09effef844ee811df107d556c1ec": {
           "model_module": "@jupyter-widgets/base",
-          "model_name": "LayoutModel",
           "model_module_version": "1.2.0",
+          "model_name": "LayoutModel",
           "state": {
             "_model_module": "@jupyter-widgets/base",
             "_model_module_version": "1.2.0",
@@ -2134,8 +2064,65 @@
         },
         "c21c25b1f9f347aaa0428e6528e3a3e0": {
           "model_module": "@jupyter-widgets/controls",
+          "model_module_version": "1.5.0",
           "model_name": "DescriptionStyleModel",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "e9024658f55f4a4e9f40c76cf3510377": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_module_version": "1.5.0",
+          "model_name": "HTMLModel",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_8ce5458513694d43a5f6e03ef4c94219",
+            "placeholder": "​",
+            "style": "IPY_MODEL_efef671ff3bd491199f713ef0dc30188",
+            "value": "100%"
+          }
+        },
+        "ea430aaddedc43d6862d13470da78261": {
+          "model_module": "@jupyter-widgets/controls",
           "model_module_version": "1.5.0",
+          "model_name": "HTMLModel",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_2d14de40880f48c5809bcd1391e1b564",
+            "placeholder": "​",
+            "style": "IPY_MODEL_c21c25b1f9f347aaa0428e6528e3a3e0",
+            "value": " 97.8M/97.8M [00:00&lt;00:00, 223MB/s]"
+          }
+        },
+        "efef671ff3bd491199f713ef0dc30188": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_module_version": "1.5.0",
+          "model_name": "DescriptionStyleModel",
           "state": {
             "_model_module": "@jupyter-widgets/controls",
             "_model_module_version": "1.5.0",
@@ -2146,6 +2133,22 @@
             "_view_name": "StyleView",
             "description_width": ""
           }
+        },
+        "f55845dcbfa445f18884408313d7a229": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_module_version": "1.5.0",
+          "model_name": "ProgressStyleModel",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ProgressStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "bar_color": null,
+            "description_width": ""
+          }
         }
       }
     }
diff --git a/demo/webcam_demo.py b/demo/webcam_demo.py
index 7102375b7e..cdcb9fc289 100644
--- a/demo/webcam_demo.py
+++ b/demo/webcam_demo.py
@@ -12,7 +12,6 @@
 from mmengine.dataset import Compose, pseudo_collate
 
 from mmaction.apis import init_recognizer
-from mmaction.utils import register_all_modules
 
 FONTFACE = cv2.FONT_HERSHEY_COMPLEX_SMALL
 FONTSCALE = 1
@@ -169,9 +168,6 @@ def main():
         device, model, camera, data, label, sample_length, \
         test_pipeline, frame_queue, result_queue
 
-    # Register all modules in mmaction2 into the registries
-    register_all_modules()
-
     args = parse_args()
     average_size = args.average_size
     threshold = args.threshold
diff --git a/docs/en/get_started.md b/docs/en/get_started.md
index cd506623f0..0f0ac1c5ec 100644
--- a/docs/en/get_started.md
+++ b/docs/en/get_started.md
@@ -111,13 +111,11 @@ Option (b). If you install mmaction2 as a python package, you can run the follow
 ```python
 from operator import itemgetter
 from mmaction.apis import init_recognizer, inference_recognizer
-from mmaction.utils import register_all_modules
 
 config_file = 'tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb.py'
 checkpoint_file = 'tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb_20220906-2692d16c.pth'
 video_file = 'demo/demo.mp4'
 label_file = 'tools/data/kinetics/label_map_k400.txt'
-register_all_modules()  # register all modules and set mmaction2 as the default scope.
 model = init_recognizer(config_file, checkpoint_file, device='cpu')  # or device='cuda:0'
 pred_result = inference_recognizer(model, video_file)
 
diff --git a/docs/en/notes/changelog.md b/docs/en/notes/changelog.md
index b0e16351eb..bbf5b9ffbc 100644
--- a/docs/en/notes/changelog.md
+++ b/docs/en/notes/changelog.md
@@ -1,10 +1,37 @@
 # Changelog
 
+## 1.0.0rc3 (2/10/2023)
+
+**Highlights**
+
+- Support Action Recognition model UniFormer V1(ICLR'2022), UniFormer V2(Arxiv'2022).
+- Support training MViT V2(CVPR'2022), and MaskFeat(CVPR'2022) fine-tuning.
+
+**New Features**
+
+- Support UniFormer V1/V2 ([#2153](https://github.com/open-mmlab/mmaction2/pull/2153))
+- Support training MViT, and MaskFeat fine-tuning ([#2186](https://github.com/open-mmlab/mmaction2/pull/2186))
+- Support a unified inference interface: Inferencer ([#2164](https://github.com/open-mmlab/mmaction2/pull/2164))
+
+**Improvements**
+
+- Support load data list from multi-backends ([#2176](https://github.com/open-mmlab/mmaction2/pull/2176))
+
+**Bug Fixes**
+
+- Upgrade isort to fix CI ([#2198](https://github.com/open-mmlab/mmaction2/pull/2198))
+- Fix bug in skeleton demo ([#2214](https://github.com/open-mmlab/mmaction2/pull/2214))
+
+**Documentation**
+
+- Add Chinese documentation for config.md ([#2188](https://github.com/open-mmlab/mmaction2/pull/2188))
+- Add readme for Omnisource ([#2205](https://github.com/open-mmlab/mmaction2/pull/2205))
+
 ## 1.0.0rc2 (1/10/2023)
 
 **Highlights**
 
-- Support Action Recognition model VideoMAE(NeurIPS'2022), MVit V2(CVPR'2022), C2D and skeleton-based action recognition model STGCN++
+- Support Action Recognition model VideoMAE(NeurIPS'2022), MViT V2(CVPR'2022), C2D and skeleton-based action recognition model STGCN++
 - Support Omni-Source training on ImageNet and Kinetics datasets
 - Support exporting spatial-temporal detection models to ONNX
 
@@ -12,7 +39,7 @@
 
 - Support VideoMAE ([#1942](https://github.com/open-mmlab/mmaction2/pull/1942))
 - Support MViT V2 ([#2007](https://github.com/open-mmlab/mmaction2/pull/2007))
-- Supoort C2D ([#2022](https://github.com/open-mmlab/mmaction2/pull/2022))
+- Support C2D ([#2022](https://github.com/open-mmlab/mmaction2/pull/2022))
 - Support AVA-Kinetics dataset ([#2080](https://github.com/open-mmlab/mmaction2/pull/2080))
 - Support STGCN++ ([#2156](https://github.com/open-mmlab/mmaction2/pull/2156))
 - Support exporting spatial-temporal detection models to ONNX ([#2148](https://github.com/open-mmlab/mmaction2/pull/2148))
@@ -61,7 +88,7 @@
 - Refine some docs ([#2038](https://github.com/open-mmlab/mmaction2/pull/2038)), ([#2040](https://github.com/open-mmlab/mmaction2/pull/2040)), ([#2058](https://github.com/open-mmlab/mmaction2/pull/2058))
 - Update TSN/TSM Readme ([#2082](https://github.com/open-mmlab/mmaction2/pull/2082))
 - Add chinese document ([#2083](https://github.com/open-mmlab/mmaction2/pull/2083))
-- Adjust docment structure ([#2088](https://github.com/open-mmlab/mmaction2/pull/2088))
+- Adjust document structure ([#2088](https://github.com/open-mmlab/mmaction2/pull/2088))
 - Fix Sth-Sth and Jester dataset links ([#2103](https://github.com/open-mmlab/mmaction2/pull/2103))
 - Fix doc link ([#2131](https://github.com/open-mmlab/mmaction2/pull/2131))
 
diff --git a/docs/en/user_guides/3_inference.md b/docs/en/user_guides/3_inference.md
index 5bd459e51f..11b07f0519 100644
--- a/docs/en/user_guides/3_inference.md
+++ b/docs/en/user_guides/3_inference.md
@@ -24,14 +24,11 @@ Run 'wget https://github.com/open-mmlab/mmaction2/blob/dev-1.x/demo/demo.mp4' to
 
 ```python
 from mmaction.apis import inference_recognizer, init_recognizer
-from mmaction.utils import register_all_modules
 
 config_path = 'configs/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb.py'
 checkpoint_path = 'https://download.openmmlab.com/mmaction/v1.0/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb/tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb_20220906-2692d16c.pth' # can be a local path
 img_path = 'demo/demo.mp4'   # you can specify your own picture path
 
-# register all modules and set mmcls as the default scope.
-register_all_modules()
 # build the model from a config file and a checkpoint file
 model = init_recognizer(config_path, checkpoint_path, device="cpu")  # device can be 'cuda:0'
 # test a single image
diff --git a/docs/zh_cn/get_started.md b/docs/zh_cn/get_started.md
index df0851235a..b98358d166 100644
--- a/docs/zh_cn/get_started.md
+++ b/docs/zh_cn/get_started.md
@@ -110,13 +110,11 @@ python demo/demo.py tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb
 
 ```python
 from mmaction.apis import init_recognizer, inference_recognizer
-from mmaction.utils import register_all_modules
 
 config_file = 'tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb.py'
 checkpoint_file = 'tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb_20220906-2692d16c.pth'
 video_file = 'demo/demo.mp4'
 label_file = 'tools/data/kinetics/label_map_k400.txt'
-register_all_modules()  # register all modules and set mmaction2 as the default scope.
 model = init_recognizer(config_file, checkpoint_file, device='cpu')  # or device='cuda:0'
 result = inference_recognizer(model, video_file)
 pred_scores = result.pred_scores.item.tolist()
diff --git a/docs/zh_cn/user_guides/1_config.md b/docs/zh_cn/user_guides/1_config.md
index 22d2d90b90..4c897a8bea 100644
--- a/docs/zh_cn/user_guides/1_config.md
+++ b/docs/zh_cn/user_guides/1_config.md
@@ -1 +1,707 @@
-# 教程 1：学习配置文件（内容建设中）
+# 教程 1：如何编写配置文件
+
+MMAction2 使用 python 文件作为配置文件。其配置文件系统的设计将模块化与继承整合进来，方便用户进行各种实验。
+MMAction2 提供的所有配置文件都放置在 `$MMAction2/configs` 文件夹下，用户可以通过运行命令
+`python tools/analysis_tools/print_config.py /PATH/TO/CONFIG` 来查看完整的配置信息，从而方便检查所对应的配置文件。
+
+<!-- TOC -->
+
+- [通过命令行参数修改配置信息](#通过命令行参数修改配置信息)
+- [配置文件结构](#配置文件结构)
+- [配置文件命名规则](#配置文件命名规则)
+  - [动作识别的配置文件系统](#动作识别的配置文件系统)
+  - [时空动作检测的配置文件系统](#时空动作检测的配置文件系统)
+  - [时序动作检测的配置文件系统](#时序动作检测的配置文件系统)
+
+<!-- TOC -->
+
+## 通过命令行参数修改配置信息
+
+当用户使用脚本 "tools/train.py" 或者 "tools/test.py" 提交任务时，可以通过指定 `--cfg-options` 参数来直接修改所使用的配置文件内容。
+
+- 更新配置文件内的字典
+
+  用户可以按照原始配置中的字典键顺序来指定配置文件的设置。
+  例如，`--cfg-options model.backbone.norm_eval=False` 会改变 `train` 模式下模型主干网络 backbone 中所有的 BN 模块。
+
+- 更新配置文件内列表的键
+
+  配置文件中，存在一些由字典组成的列表。例如，训练数据前处理流水线 data.train.pipeline 就是 python 列表。
+  如，`[dict(type='SampleFrames'), ...]`。如果用户想更改其中的 `'SampleFrames'` 为 `'DenseSampleFrames'`，
+  可以指定 `--cfg-options data.train.pipeline.0.type=DenseSampleFrames`。
+
+- 更新列表/元组的值。
+
+  当配置文件中需要更新的是一个列表或者元组，例如，配置文件通常会设置 `model.data_preprocessor.mean=[123.675, 116.28, 103.53]`，用户如果想更改，
+  需要指定 `--cfg-options model.data_preprocessor.mean="[128,128,128]"`。注意这里的引号 " 对于列表/元组数据类型的修改是必要的。
+
+## 配置文件结构
+
+在 `config/_base_` 文件夹下存在 3 种基本组件类型： 模型（model）, 训练策略（schedule）, 运行时的默认设置（default_runtime）。
+许多方法都可以方便地通过组合这些组件进行实现，如 TSN，I3D，SlowOnly 等。
+其中，通过 `_base_` 下组件来构建的配置被称为 _原始配置_（_primitive_）。
+
+对于在同一文件夹下的所有配置文件，MMAction2 推荐只存在 **一个** 对应的 _原始配置_ 文件。
+所有其他的配置文件都应该继承 _原始配置_ 文件，这样就能保证配置文件的最大继承深度为 3。
+
+为了方便理解，MMAction2 推荐用户继承现有方法的配置文件。
+例如，如需修改 TSN 的配置文件，用户应先通过 `_base_ = '../tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x3-100e_kinetics400-rgb.py'` 继承 TSN 配置文件的基本结构，
+并修改其中必要的内容以完成继承。
+
+如果用户想实现一个独立于任何一个现有的方法结构的新方法，则可以在 `configs/TASK` 中建立新的文件夹。
+
+更多详细内容，请参考 [mmengine](https://mmengine.readthedocs.io/en/latest/tutorials/config.html)。
+
+## 配置文件命名规则
+
+MMAction2 按照以下风格进行配置文件命名，代码库的贡献者需要遵循相同的命名规则。配置文件名分为几个部分。逻辑上，不同的部分用下划线 `'_'`连接，同一部分中的设置用破折号 `'-'`连接。
+
+```
+{algorithm info}_{module info}_{training info}_{data info}.py
+```
+
+其中，`{xxx}` 表示必要的命名域，`[yyy]` 表示可选的命名域。
+
+- `{algorithm info}`:
+  - `{model}`: 模型类型，如 `tsn`，`i3d`, `swin`, `vit` 等。
+  - `[model setting]`: 一些模型上的特殊设置,如`base`, `p16`, `w877`等。
+- `{module info}`:
+  - `[pretained info]`: 预训练信息,如 `kinetics400-pretrained`， `in1k-pre`等.
+  - `{backbone}`: 主干网络类型和预训练信息，如 `r50`（ResNet-50）等。
+  - `[backbone setting]`: 对于一些骨干网络的特殊设置，如`nl-dot-product`, `bnfrozen`, `nopool`等。
+- `{training info}`:
+  - `{gpu x batch_per_gpu]}`: GPU 数量以及每个 GPU 上的采样。
+  - `{pipeline setting}`: 采帧数据格式，形如 `dense`, `{clip_len}x{frame_interval}x{num_clips}`, `u48`等。
+  - `{schedule}`: 训练策略设置，如 `20e` 表示 20 个周期（epoch）。
+- `{data info}`:
+  - `{dataset}`:数据集名，如 `kinetics400`，`mmit`等。
+  - `{modality}`: 帧的模态，如 `rgb`, `flow`, `keypoint-2d`等。
+
+### 动作识别的配置文件系统
+
+MMAction2 将模块化设计整合到配置文件系统中，以便执行各类不同实验。
+
+- 以 TSN 为例
+
+  为了帮助用户理解 MMAction2 的配置文件结构，以及动作识别系统中的一些模块，这里以 TSN 为例，给出其配置文件的注释。
+  对于每个模块的详细用法以及对应参数的选择，请参照 API 文档。
+
+  ```python
+  # 模型设置
+  model = dict(  # 模型的配置
+      type='Recognizer2D',  # 动作识别器的类型
+      backbone=dict(  # Backbone 字典设置
+          type='ResNet',  # Backbone 名
+          pretrained='torchvision://resnet50',  # 预训练模型的 url 或文件位置
+          depth=50,  # ResNet 模型深度
+          norm_eval=False),  # 训练时是否设置 BN 层为验证模式
+      cls_head=dict(  # 分类器字典设置
+          type='TSNHead',  # 分类器名
+          num_classes=400,  # 分类类别数量
+          in_channels=2048,  # 分类器里输入通道数
+          spatial_type='avg',  # 空间维度的池化种类
+          consensus=dict(type='AvgConsensus', dim=1),  # consensus 模块设置
+          dropout_ratio=0.4,  # dropout 层概率
+          init_std=0.01,  # 线性层初始化 std 值
+          average_clips='prob'),  # 平均多个 clip 结果的方法
+      data_preprocessor=dict(  # 数据预处理器的字典设置
+          type='ActionDataPreprocessor',  # 数据预处理器名
+          mean=[123.675, 116.28, 103.53],  # 不同通道归一化所用的平均值
+          std=[58.395, 57.12, 57.375],  # 不同通道归一化所用的方差
+          format_shape='NCHW'),  # 最终图像形状格式
+      # 模型训练和测试的设置
+      train_cfg=None,  # 训练 TSN 的超参配置
+      test_cfg=None)  # 测试 TSN 的超参配置
+
+  # 数据集设置
+  dataset_type = 'RawframeDataset'  # 训练，验证，测试的数据集类型
+  data_root = 'data/kinetics400/rawframes_train/'  # 训练集的根目录
+  data_root_val = 'data/kinetics400/rawframes_val/'  # 验证集，测试集的根目录
+  ann_file_train = 'data/kinetics400/kinetics400_train_list_rawframes.txt'  # 训练集的标注文件
+  ann_file_val = 'data/kinetics400/kinetics400_val_list_rawframes.txt'  # 验证集的标注文件
+  ann_file_test = 'data/kinetics400/kinetics400_val_list_rawframes.txt'  # 测试集的标注文件
+
+  train_pipeline = [  # 训练数据前处理流水线步骤组成的列表
+      dict(  # SampleFrames 类的配置
+          type='SampleFrames',  # 选定采样哪些视频帧
+          clip_len=1,  # 每个输出视频片段的帧
+          frame_interval=1,  # 所采相邻帧的时序间隔
+          num_clips=3),  # 所采帧片段的数量
+      dict(  # RawFrameDecode 类的配置
+          type='RawFrameDecode'),  # 给定帧序列，加载对应帧，解码对应帧
+      dict(  # Resize 类的配置
+          type='Resize',  # 调整图片尺寸
+          scale=(-1, 256)),  # 调整比例
+      dict(  # MultiScaleCrop 类的配置
+          type='MultiScaleCrop',  # 多尺寸裁剪，随机从一系列给定尺寸中选择一个比例尺寸进行裁剪
+          input_size=224,  # 网络输入
+          scales=(1, 0.875, 0.75, 0.66),  # 长宽比例选择范围
+          random_crop=False,  # 是否进行随机裁剪
+          max_wh_scale_gap=1),  # 长宽最大比例间隔
+      dict(  # Resize 类的配置
+          type='Resize',  # 调整图片尺寸
+          scale=(224, 224),  # 调整比例
+          keep_ratio=False),  # 是否保持长宽比
+      dict(  # Flip 类的配置
+          type='Flip',  # 图片翻转
+          flip_ratio=0.5),  # 执行翻转几率
+      dict(  # FormatShape 类的配置
+          type='FormatShape',  # 将图片格式转变为给定的输入格式
+          input_format='NCHW'),  # 最终的图片组成格式
+      dict(  # PackActionInputs 类的配置
+          type='PackActionInputs')  # 将输入数据打包
+  ]
+  val_pipeline = [  # 验证数据前处理流水线步骤组成的列表
+      dict(  # SampleFrames 类的配置
+          type='SampleFrames',  # 选定采样哪些视频帧
+          clip_len=1,  # 每个输出视频片段的帧
+          frame_interval=1,  # 所采相邻帧的时序间隔
+          num_clips=3,  # 所采帧片段的数量
+          test_mode=True),  # 是否设置为测试模式采帧
+      dict(  # RawFrameDecode 类的配置
+          type='RawFrameDecode'),  # 给定帧序列，加载对应帧，解码对应帧
+      dict(  # Resize 类的配置
+          type='Resize',  # 调整图片尺寸
+          scale=(-1, 256)),  # 调整比例
+      dict(  # CenterCrop 类的配置
+          type='CenterCrop',  # 中心裁剪
+          crop_size=224),  # 裁剪部分的尺寸
+      dict(  # Flip 类的配置
+          type='Flip',  # 图片翻转
+          flip_ratio=0),  # 翻转几率
+      dict(  # FormatShape 类的配置
+          type='FormatShape',  # 将图片格式转变为给定的输入格式
+          input_format='NCHW'),  # 最终的图片组成格式
+      dict(  # PackActionInputs 类的配置
+          type='PackActionInputs')  # 将输入数据打包
+  ]
+  test_pipeline = [  # 测试数据前处理流水线步骤组成的列表
+      dict(  # SampleFrames 类的配置
+          type='SampleFrames',  # 选定采样哪些视频帧
+          clip_len=1,  # 每个输出视频片段的帧
+          frame_interval=1,  # 所采相邻帧的时序间隔
+          num_clips=25,  # 所采帧片段的数量
+          test_mode=True),  # 是否设置为测试模式采帧
+      dict(  # RawFrameDecode 类的配置
+          type='RawFrameDecode'),  # 给定帧序列，加载对应帧，解码对应帧
+      dict(  # Resize 类的配置
+          type='Resize',  # 调整图片尺寸
+          scale=(-1, 256)),  # 调整比例
+      dict(  # TenCrop 类的配置
+          type='TenCrop',  # 裁剪 10 个区域
+          crop_size=224),  # 裁剪部分的尺寸
+      dict(  # Flip 类的配置
+          type='Flip',  # 图片翻转
+          flip_ratio=0),  # 执行翻转几率
+      dict(  # FormatShape 类的配置
+          type='FormatShape',  # 将图片格式转变为给定的输入格式
+          input_format='NCHW'),  # 最终的图片组成格式
+      dict(  # PackActionInputs 类的配置
+          type='PackActionInputs')  # 将输入数据打包
+  ]
+
+  train_dataloader = dict(  # 训练过程 dataloader 的配置
+      batch_size=32,  # 训练过程单个 GPU 的批大小
+      num_workers=8,  # 训练过程单个 GPU 的 数据预取的进程数
+      persistent_workers=True,  # 保持`Dataset` 实例
+      sampler=dict(type='DefaultSampler', shuffle=True),
+      dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_train,
+        data_prefix=dict(video=data_root),
+        pipeline=train_pipeline))
+  val_dataloader = dict(  # 验证过程 dataloader 的配置
+      batch_size=1,  # 验证过程单个 GPU 的批大小
+      num_workers=8,  # 验证过程单个 GPU 的 数据预取的进程
+      persistent_workers=True,  # 保持`Dataset` 实例
+      sampler=dict(type='DefaultSampler', shuffle=False),
+      dataset=dict(
+          type=dataset_type,
+          ann_file=ann_file_val,
+          data_prefix=dict(video=data_root_val),
+          pipeline=val_pipeline,
+          test_mode=True))
+  test_dataloader = dict(  # 测试过程 dataloader 的配置
+      batch_size=32,  # 测试过程单个 GPU 的批大小
+      num_workers=8,  # 测试过程单个 GPU 的 数据预取的进程
+      persistent_workers=True,  # 保持`Dataset` 实例
+      sampler=dict(type='DefaultSampler', shuffle=False),
+      dataset=dict(
+          type=dataset_type,
+          ann_file=ann_file_val,
+          data_prefix=dict(video=data_root_val),
+          pipeline=test_pipeline,
+          test_mode=True))
+
+  # 评测器设置
+  val_evaluator = dict(type='AccMetric')  # 用于计算验证指标的评测对象
+  test_evaluator = dict(type='AccMetric')  # 用于计算测试指标的评测对象
+
+  train_cfg = dict(  # 训练循环的配置
+      type='EpochBasedTrainLoop',  # 训练循环的名称
+      max_epochs=100,  # 整体循环次数
+      val_begin=1,  # 开始验证的轮次
+      val_interval=1)  # 执行验证的间隔
+  val_cfg = dict(  # 验证循环的配置
+      type='ValLoop')  # 验证循环的名称
+  test_cfg = dict( # 测试循环的配置
+      type='TestLoop')  # 测试循环的名称
+
+  # 学习策略设置
+  param_scheduler = [  # 用于更新优化器参数的参数调度程序，支持字典或列表
+      dict(type='MultiStepLR',  # 当轮次数达到阈值，学习率衰减
+          begin=0,  # 开始更新学习率的步长
+          end=100,  # 停止更新学习率的步长
+          by_epoch=True,  # 学习率是否按轮次更新
+          milestones=[40, 80],  # 学习率衰减阈值
+          gamma=0.1)]  # 学习率衰减的乘数因子
+
+  # 优化器设置
+  optim_wrapper = dict(  # 优化器钩子的配置
+      type='OptimWrapper',  #  优化器封装的名称, 切换到 AmpOptimWrapper 可以实现混合精度训练
+      optimizer=dict(  # 优化器配置。 支持各种在pytorch上的优化器。 参考 https://pytorch.org/docs/stable/optim.html#algorithms
+          type='SGD',  # 优化器名称
+          lr=0.01,  # 学习率
+          momentum=0.9,  # 动量大小
+          weight_decay=0.0001)  # SGD 优化器权重衰减
+      clip_grad=dict(max_norm=40, norm_type=2))  # 梯度裁剪的配置
+
+  # 运行设置
+  default_scope = 'mmaction'  # 查找模块的默认注册表范围。 参考 https://mmengine.readthedocs.io/en/latest/tutorials/registry.html
+  default_hooks = dict(  # 执行默认操作的钩子，如更新模型参数和保存checkpoints。
+      runtime_info=dict(type='RuntimeInfoHook'),  # 将运行信息更新到消息中心的钩子。
+      timer=dict(type='IterTimerHook'),  # 记录迭代期间花费时间的日志。
+      logger=dict(
+          type='LoggerHook',  # 记录训练/验证/测试阶段记录日志。
+          interval=20,  # 打印日志间隔
+          ignore_last=False), # 忽略每个轮次中最后一次迭代的日志
+      param_scheduler=dict(type='ParamSchedulerHook'),  # 更新优化器中一些超参数的钩子
+      checkpoint=dict(
+          type='CheckpointHook',  # 定期保存检查点的钩子
+          interval=3,  # 保存周期
+          save_best='auto',  # 在评估期间测量最佳检查点的指标
+          max_keep_ckpts=3),  # 要保留的最大检查点
+      sampler_seed=dict(type='DistSamplerSeedHook'),  # 分布式训练的数据加载采样器
+      sync_buffers=dict(type='SyncBuffersHook'))  # 在每个轮次结束时同步模型缓冲区
+  env_cfg = dict(  # 环境设置
+      cudnn_benchmark=False,  # 是否启用cudnn基准
+      mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), # 设置多线程处理的参数
+      dist_cfg=dict(backend='nccl')) # 设置分布式环境的参数，也可以设置端口
+
+  log_processor = dict(
+      type='LogProcessor',  # 用于格式化日志信息的日志处理器
+      window_size=20,  # 默认平滑间隔
+      by_epoch=True)  # 是否以epoch类型格式化日志
+  vis_backends = [  # 可视化后端列表
+      dict(type='LocalVisBackend')]  # 本地可视化后端
+  visualizer = dict(  # 可视化工具的配置
+      type='ActionVisualizer',  # 可视化工具的名称
+      vis_backends=vis_backends)
+  log_level = 'INFO'  # 日志记录级别
+  load_from = None  # 从给定路径加载模型checkpoint作为预训练模型。这不会恢复训练。
+  resume = False  # 是否从`load_from`中定义的checkpoint恢复。如果“load_from”为“None”，它将恢复“work_dir”中的最新的checkpoint。
+  ```
+
+### 时空动作检测的配置文件系统
+
+MMAction2 将模块化设计整合到配置文件系统中，以便于执行各种不同的实验。
+
+- 以 FastRCNN 为例
+
+  为了帮助用户理解 MMAction2 的完整配置文件结构，以及时空检测系统中的一些模块，这里以 FastRCNN 为例，给出其配置文件的注释。
+  对于每个模块的详细用法以及对应参数的选择，请参照 [API 文档](https://mmaction2.readthedocs.io/en/latest/api.html)。
+
+  ```python
+  # 模型设置
+  model = dict(  # 模型的配置
+      type='FastRCNN',  # 时空检测器类型
+      _scope_='mmdet',  # 当前配置的范围
+      backbone=dict(  # Backbone 字典设置
+          type='ResNet3dSlowOnly',  # Backbone 名
+          depth=50, # ResNet 模型深度
+          pretrained=None,   # 预训练模型的 url 或文件位置
+          pretrained2d=False, # 预训练模型是否为 2D 模型
+          lateral=False,  # backbone 是否有侧连接
+          num_stages=4, # ResNet 模型阶数
+          conv1_kernel=(1, 7, 7), # Conv1 卷积核尺寸
+          conv1_stride_t=1, # Conv1 时序步长
+          pool1_stride_t=1, # Pool1 时序步长
+          spatial_strides=(1, 2, 2, 1)),  # 每个 ResNet 阶的空间步长
+      roi_head=dict(  # roi_head 字典设置
+          type='AVARoIHead',  # roi_head 名
+          bbox_roi_extractor=dict(  # bbox_roi_extractor 字典设置
+              type='SingleRoIExtractor3D',  # bbox_roi_extractor 名
+              roi_layer_type='RoIAlign',  # RoI op 类型
+              output_size=8,  # RoI op 输出特征尺寸
+              with_temporal_pool=True), # 时序维度是否要经过池化
+          bbox_head=dict( # bbox_head 字典设置
+              type='BBoxHeadAVA', # bbox_head 名
+              in_channels=2048, # 输入特征通道数
+              num_classes=81, # 动作类别数 + 1（背景）
+              multilabel=True,  # 数据集是否多标签
+              dropout_ratio=0.5)),  # dropout 比率
+      data_preprocessor=dict(  # 数据预处理器的字典
+          type='ActionDataPreprocessor',  # 数据预处理器的名称
+          mean=[123.675, 116.28, 103.53],  # 不同通道归一化的均值
+          std=[58.395, 57.12, 57.375],  # 不同通道归一化的方差
+          format_shape='NCHW')),  # 最终图像形状
+      # 模型训练和测试的设置
+      train_cfg=dict(  # 训练 FastRCNN 的超参配置
+          rcnn=dict(  # rcnn 训练字典设置
+              assigner=dict(  # assigner 字典设置
+                  type='MaxIoUAssignerAVA', # assigner 名
+                  pos_iou_thr=0.9,  # 正样本 IoU 阈值, > pos_iou_thr -> positive
+                  neg_iou_thr=0.9,  # 负样本 IoU 阈值, < neg_iou_thr -> negative
+                  min_pos_iou=0.9), # 正样本最小可接受 IoU
+              sampler=dict( # sample 字典设置
+                  type='RandomSampler', # sampler 名
+                  num=32, # sampler 批大小
+                  pos_fraction=1, # sampler 正样本边界框比率
+                  neg_pos_ub=-1,  # 负样本数转正样本数的比率上界
+                  add_gt_as_proposals=True), # 是否添加 ground truth 为候选
+              pos_weight=1.0)), # 正样本 loss 权重
+      test_cfg=dict( # 测试 FastRCNN 的超参设置
+          rcnn=dict(rcnn=None))  # rcnn 测试字典设置
+
+  # 数据集设置
+  dataset_type = 'AVADataset' # 训练，验证，测试的数据集类型
+  data_root = 'data/ava/rawframes'  # 训练集的根目录
+  anno_root = 'data/ava/annotations'  # 标注文件目录
+
+  ann_file_train = f'{anno_root}/ava_train_v2.1.csv'  # 训练集的标注文件
+  ann_file_val = f'{anno_root}/ava_val_v2.1.csv'  # 验证集的标注文件
+
+  exclude_file_train = f'{anno_root}/ava_train_excluded_timestamps_v2.1.csv'  # 训练除外数据集文件路径
+  exclude_file_val = f'{anno_root}/ava_val_excluded_timestamps_v2.1.csv'  # 验证除外数据集文件路径
+
+  label_file = f'{anno_root}/ava_action_list_v2.1_for_activitynet_2018.pbtxt'  # 标签文件路径
+
+  proposal_file_train = f'{anno_root}/ava_dense_proposals_train.FAIR.recall_93.9.pkl'  # 训练样本检测候选框的文件路径
+  proposal_file_val = f'{anno_root}/ava_dense_proposals_val.FAIR.recall_93.9.pkl'  # 验证样本检测候选框的文件路径
+
+
+  train_pipeline = [  # 训练数据前处理流水线步骤组成的列表
+      dict(  # SampleFrames 类的配置
+          type='AVASampleFrames',  # 选定采样哪些视频帧
+          clip_len=4,  # 每个输出视频片段的帧
+          frame_interval=16), # 所采相邻帧的时序间隔
+      dict(  # RawFrameDecode 类的配置
+          type='RawFrameDecode'),  # 给定帧序列，加载对应帧，解码对应帧
+      dict(  # RandomRescale 类的配置
+          type='RandomRescale',   # 给定一个范围，进行随机短边缩放
+          scale_range=(256, 320)),   # RandomRescale 的短边缩放范围
+      dict(  # RandomCrop 类的配置
+          type='RandomCrop',   # 给定一个尺寸进行随机裁剪
+          size=256),   # 裁剪尺寸
+      dict(  # Flip 类的配置
+          type='Flip',  # 图片翻转
+          flip_ratio=0.5),  # 执行翻转几率
+      dict(  # FormatShape 类的配置
+          type='FormatShape',  # 将图片格式转变为给定的输入格式
+          input_format='NCTHW',  # 最终的图片组成格式
+          collapse=True),   # 去掉 N 梯度当 N == 1
+      dict(type='PackActionInputs')# 打包输入数据
+  ]
+
+  val_pipeline = [  # 验证数据前处理流水线步骤组成的列表
+      dict(  # SampleFrames 类的配置
+          type='AVASampleFrames',  # 选定采样哪些视频帧
+          clip_len=4,  # 每个输出视频片段的帧
+          frame_interval=16),  # 所采相邻帧的时序间隔
+      dict(  # RawFrameDecode 类的配置
+          type='RawFrameDecode'),  # 给定帧序列，加载对应帧，解码对应帧
+      dict(  # Resize 类的配置
+          type='Resize',  # 调整图片尺寸
+          scale=(-1, 256)),  # 调整比例
+      dict(  # FormatShape 类的配置
+          type='FormatShape',  # 将图片格式转变为给定的输入格式
+          input_format='NCTHW',  # 最终的图片组成格式
+          collapse=True),   # 去掉 N 梯度当 N == 1
+      dict(type='PackActionInputs') # 打包输入数据
+  ]
+
+  train_dataloader = dict(  # 训练过程 dataloader 的配置
+      batch_size=32,  # 训练过程单个 GPU 的批大小
+      num_workers=8,  # 训练过程单个 GPU 的 数据预取的进程
+      persistent_workers=True,  # 如果为“True”，则数据加载器不会在轮次结束后关闭工作进程，这可以加快训练速度
+      sampler=dict(
+          type='DefaultSampler', # 支持分布式和非分布式的DefaultSampler
+          shuffle=True), 随机打乱每个轮次的训练数据
+      dataset=dict(  # 训练数据集的配置
+          type=dataset_type,
+          ann_file=ann_file_train,  # 标注文件的路径
+          exclude_file=exclude_file_train,  # 不包括的标注文件路径
+          label_file=label_file,  # 标签文件的路径
+          data_prefix=dict(img=data_root),  # 帧路径的前缀
+          proposal_file=proposal_file_train,  # 行人检测框的路径
+          pipeline=train_pipeline))
+  val_dataloader = dict(  # 验证过程 dataloader 的配置
+      batch_size=1,  # 验证过程单个 GPU 的批大小
+      num_workers=8,  # 验证过程单个 GPU 的 数据预取的进程
+      persistent_workers=True,  # 保持`Dataset` 实例
+      sampler=dict(
+          type='DefaultSampler',
+          shuffle=False),  # 在验证测试期间不打乱数据
+      dataset=dict(  # 验证集的配置
+          type=dataset_type,
+          ann_file=ann_file_val,  # 标注文件的路径
+          exclude_file=exclude_file_train,  # 不包括的标注文件路径
+          label_file=label_file,  # 标签文件的路径
+          data_prefix=dict(video=data_root_val),  # 帧路径的前缀
+          proposal_file=proposal_file_val,  # # 行人检测框的路径
+          pipeline=val_pipeline,
+          test_mode=True))
+  test_dataloader = val_dataloader  # 测试过程 dataloader 的配置
+
+
+  # 评估器设置
+  val_evaluator = dict(  # 验证评估器的配置
+      type='AccMetric',
+      ann_file=ann_file_val,
+      label_file=label_file,
+      exclude_file=exclude_file_val)
+  test_evaluator = val_evaluator  # 测试评估器的配置
+
+  train_cfg = dict(  # 训练循环的配置
+      type='EpochBasedTrainLoop',  # 训练循环的名称
+      max_epochs=20,  # 整体循环次数
+      val_begin=1,  # 开始验证的轮次
+      val_interval=1)  # 执行验证的间隔
+  val_cfg = dict(  # 验证循环的配置
+      type='ValLoop')  # 验证循环的名称
+  test_cfg = dict( # 测试循环的配置
+      type='TestLoop')  # 测试循环的名称
+
+  # 学习策略设置
+  param_scheduler = [  # 用于更新优化器参数的参数调度程序，支持字典或列表
+      dict（type='LinearLR'，# 通过乘法因子线性衰减来降低各参数组的学习率
+          start_factor=0.1，# 乘以第一个轮次的学习率的数值
+          by_epoch=True，# 学习率是否按轮次更新
+          begin=0，# 开始更新学习率的步长
+          end=5），# 停止更新学习率的步长
+      dict(type='MultiStepLR',  # 当轮次数达到阈值，学习率衰减
+          begin=0,  # 开始更新学习率的步长
+          end=20,  # 停止更新学习率的步长
+          by_epoch=True,  # 学习率是否按轮次更新
+          milestones=[10, 15],  # 学习率衰减阈值
+          gamma=0.1)]  # 学习率衰减的乘数因子
+
+
+  # 优化器设置
+  optim_wrapper = dict(  # 优化器钩子的配置
+      type='OptimWrapper',  #  优化器封装的名称, 切换到 AmpOptimWrapper 可以实现混合精度训练
+      optimizer=dict(  # 优化器配置。 支持各种在pytorch上的优化器。 参考 https://pytorch.org/docs/stable/optim.html#algorithms
+          type='SGD',  # 优化器名称
+          lr=0.2,  # 学习率
+          momentum=0.9,  # 动量大小
+          weight_decay=0.0001)  # SGD 优化器权重衰减
+      clip_grad=dict(max_norm=40, norm_type=2))  # 梯度裁剪的配置
+
+  # 运行设置
+  default_scope = 'mmaction'  # 查找模块的默认注册表范围。 参考 https://mmengine.readthedocs.io/en/latest/tutorials/registry.html
+  default_hooks = dict(  # 执行默认操作的钩子，如更新模型参数和保存checkpoints。
+      runtime_info=dict(type='RuntimeInfoHook'),  # 将运行信息更新到消息中心的钩子。
+      timer=dict(type='IterTimerHook'),  # 记录迭代期间花费时间的日志。
+      logger=dict(
+          type='LoggerHook',  # 记录训练/验证/测试阶段记录日志。
+          interval=20,  # 打印日志间隔
+          ignore_last=False), # 忽略每个轮次中最后一次迭代的日志
+      param_scheduler=dict(type='ParamSchedulerHook'),  # 更新优化器中一些超参数的钩子
+      checkpoint=dict(
+          type='CheckpointHook',  # 定期保存检查点的钩子
+          interval=3,  # 保存周期
+          save_best='auto',  # 在评估期间测量最佳检查点的指标
+          max_keep_ckpts=3),  # 要保留的最大检查点
+      sampler_seed=dict(type='DistSamplerSeedHook'),  # 分布式训练的数据加载采样器
+      sync_buffers=dict(type='SyncBuffersHook'))  # 在每个轮次结束时同步模型缓冲区
+  env_cfg = dict(  # 环境设置
+      cudnn_benchmark=False,  # 是否启用cudnn基准
+      mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), # 设置多线程处理的参数
+      dist_cfg=dict(backend='nccl')) # 设置分布式环境的参数，也可以设置端口
+
+  log_processor = dict(
+      type='LogProcessor',  # 用于格式化日志信息的日志处理器
+      window_size=20,  # 默认平滑间隔
+      by_epoch=True)  # 是否以epoch类型格式化日志
+  vis_backends = [  # 可视化后端列表
+      dict(type='LocalVisBackend')]  # 本地可视化后端
+  visualizer = dict(  # 可视化工具的配置
+      type='ActionVisualizer',  # 可视化工具的名称
+      vis_backends=vis_backends)
+  log_level = 'INFO'  # 日志记录级别
+  load_from = ('https://download.openmmlab.com/mmaction/v1.0/recognition/slowonly/'
+               'slowonly_imagenet-pretrained-r50_8xb16-4x16x1-steplr-150e_kinetics400-rgb/'
+               'slowonly_imagenet-pretrained-r50_8xb16-4x16x1-steplr-150e_kinetics400-rgb_20220901-e7b65fad.pth')  # 从给定路径加载模型checkpoint作为预训练模型。这不会恢复训练。
+  resume = False  # 是否从`load_from`中定义的checkpoint恢复。如果“load_from”为“None”，它将恢复“work_dir”中的最新的checkpoint。
+  ```
+
+### 时序动作检测的配置文件系统
+
+MMAction2 将模块化设计整合到配置文件系统中，以便于执行各种不同的实验。
+
+- 以 BMN 为例
+
+  为了帮助用户理解 MMAction2 的配置文件结构，以及时序动作检测系统中的一些模块，这里以 BMN 为例，给出其配置文件的注释。
+  对于每个模块的详细用法以及对应参数的选择，请参照 [API 文档](https://mmaction2.readthedocs.io/en/latest/api.html)。
+
+  ```python
+  # 模型设置
+  model = dict(  # 模型的配置
+      type='BMN',  # 时序动作检测器的类型
+      temporal_dim=100,  # 每个视频中所选择的帧数量
+      boundary_ratio=0.5,  # 视频边界的决策几率
+      num_samples=32,  # 每个候选的采样数
+      num_samples_per_bin=3,  # 每个样本的直方图采样数
+      feat_dim=400,  # 特征维度
+      soft_nms_alpha=0.4,  # soft-NMS 的 alpha 值
+      soft_nms_low_threshold=0.5,  # soft-NMS 的下界
+      soft_nms_high_threshold=0.9,  # soft-NMS 的上界
+      post_process_top_k=100)  # 后处理得到的最好的 K 个 proposal
+
+  # 数据集设置
+  dataset_type = 'ActivityNetDataset'  # 训练，验证，测试的数据集类型
+  data_root = 'data/activitynet_feature_cuhk/csv_mean_100/'  # 训练集的根目录
+  data_root_val = 'data/activitynet_feature_cuhk/csv_mean_100/'  # 验证集和测试集的根目录
+  ann_file_train = 'data/ActivityNet/anet_anno_train.json'  # 训练集的标注文件
+  ann_file_val = 'data/ActivityNet/anet_anno_val.json'  # 验证集的标注文件
+  ann_file_test = 'data/ActivityNet/anet_anno_test.json'  # 测试集的标注文件
+
+  train_pipeline = [  # 训练数据前处理流水线步骤组成的列表
+      dict(type='LoadLocalizationFeature'),  # 加载时序动作检测特征
+      dict(type='GenerateLocalizationLabels'),  # 生成时序动作检测标签
+      dict(
+          type='PackLocalizationInputs',  # 时序数据打包
+          keys=('gt_bbox'),  # 输入的键
+          meta_keys=('video_name'))]  # 输入的元键
+  val_pipeline = [  # 验证数据前处理流水线步骤组成的列表
+      dict(type='LoadLocalizationFeature'),  # 加载时序动作检测特征
+      dict(type='GenerateLocalizationLabels'),  # 生成时序动作检测标签
+      dict(
+          type='PackLocalizationInputs',  # 时序数据打包
+          keys=('gt_bbox'),  # 输入的键
+          meta_keys= ('video_name', 'duration_second', 'duration_frame',
+                      'annotations', 'feature_frame'))],  # 输入的元键
+  test_pipeline = [  # 测试数据前处理流水线步骤组成的列表
+      dict(type='LoadLocalizationFeature'),  # 加载时序动作检测特征
+      dict(
+          type='PackLocalizationInputs',  # 时序数据打包
+          keys=('gt_bbox'),  # 输入的键
+          meta_keys= ('video_name', 'duration_second', 'duration_frame',
+                      'annotations', 'feature_frame'))],  # 输入的元键
+  train_dataloader = dict(  # 训练过程 dataloader 的配置
+      batch_size=8,  # 训练过程单个 GPU 的批大小
+      num_workers=8,  # 训练过程单个 GPU 的 数据预取的进程
+      persistent_workers=True,  # 如果为“True”，则数据加载器不会在轮次结束后关闭工作进程，这可以加快训练速度
+      sampler=dict(
+          type='DefaultSampler', # 支持分布式和非分布式的DefaultSampler
+          shuffle=True), 随机打乱每个轮次的训练数据
+      dataset=dict(  # 训练数据集的配置
+        type=dataset_type,
+        ann_file=ann_file_train,  # 标签文件的路径
+        exclude_file=exclude_file_train,  # 不包括的标签文件路径
+        label_file=label_file,  # 标签文件的路径
+        data_prefix=dict(video=data_root),
+        data_prefix=dict(img=data_root),  # Prefix of frame path
+        pipeline=train_pipeline))
+  val_dataloader = dict(  # 验证过程 dataloader 的配置
+      batch_size=1,  # 验证过程单个 GPU 的批大小
+      num_workers=8,  # 验证过程单个 GPU 的 数据预取的进程
+      persistent_workers=True,  # 保持`Dataset` 实例
+      sampler=dict(
+          type='DefaultSampler',
+          shuffle=False),  # 在验证测试过程中不打乱数据
+      dataset=dict(  # 验证数据集的配置
+          type=dataset_type,
+          ann_file=ann_file_val,  # 标注文件的路径
+          data_prefix=dict(video=data_root_val),  # 视频路径的前缀
+          pipeline=val_pipeline,
+          test_mode=True))
+  test_dataloader = dict(  # 测试过程 dataloader 的配置
+      batch_size=1,  #测试过程单个 GPU 的批大小
+      num_workers=8,  # 测试过程单个 GPU 的 数据预取的进程
+      persistent_workers=True,  # 保持`Dataset` 实例
+      sampler=dict(
+          type='DefaultSampler',
+          shuffle=False),  # 在验证测试过程中不打乱数据
+      dataset=dict(  # 测试数据集的配置
+          type=dataset_type,
+          ann_file=ann_file_val,  # 标注文件的路径
+          data_prefix=dict(video=data_root_val),  # 视频路径的前缀
+          pipeline=test_pipeline,
+          test_mode=True))
+
+
+  # 评估器设置
+  work_dir = './work_dirs/bmn_400x100_2x8_9e_activitynet_feature/'  # 用于保存当前试验的模型检查点和日志的目录
+  val_evaluator = dict(  # 验证评估器的配置
+    type='AccMetric',
+    metric_type='AR@AN',
+    dump_config=dict(  # 时序输出的配置
+        out=f'{work_dir}/results.json',  # 输出文件的路径
+        output_format='json'))  # 输出文件的文件格式
+  test_evaluator = val_evaluator  # 测试评估器的配置
+
+  max_epochs = 9  # Total epochs to train the model
+  train_cfg = dict(  # 训练循环的配置
+     type='EpochBasedTrainLoop',  # 训练循环的名称
+     max_epochs=100,  # 整体循环次数
+     val_begin=1,  # 开始验证的轮次
+     val_interval=1)  # 执行验证的间隔
+  val_cfg = dict(  # 验证循环的配置
+     type='ValLoop')  # 验证循环的名称
+  test_cfg = dict( # 测试循环的配置
+     type='TestLoop')  # 测试循环的名称
+
+  # 学习策略设置
+  param_scheduler = [  # 用于更新优化器参数的参数调度程序，支持字典或列表
+     dict(type='MultiStepLR',  # 当轮次数达到阈值，学习率衰减
+     begin=0,  # 开始更新学习率的步长
+     end=max_epochs,  # 停止更新学习率的步长
+     by_epoch=True,  # 学习率是否按轮次更新
+     milestones=[7, ],  # 学习率衰减阈值
+     gamma=0.1)]  # 学习率衰减的乘数因子
+
+  # 优化器设置
+  optim_wrapper = dict(  # 优化器钩子的配置
+    type='OptimWrapper',  #  优化器封装的名称, 切换到 AmpOptimWrapper 可以实现混合精度训练
+    optimizer=dict(  # 优化器配置。 支持各种在pytorch上的优化器。 参考 https://pytorch.org/docs/stable/optim.html#algorithms
+      type='Adam',  # 优化器名称
+      lr=0.001,  # 学习率
+      weight_decay=0.0001)  # 权重衰减
+    clip_grad=dict(max_norm=40, norm_type=2))  # 梯度裁剪的配置
+
+  # 运行设置
+  default_scope = 'mmaction'  # 查找模块的默认注册表范围。 参考 https://mmengine.readthedocs.io/en/latest/tutorials/registry.html
+  default_hooks = dict(  # 执行默认操作的钩子，如更新模型参数和保存checkpoints。
+      runtime_info=dict(type='RuntimeInfoHook'),  # 将运行信息更新到消息中心的钩子。
+      timer=dict(type='IterTimerHook'),  # 记录迭代期间花费时间的日志。
+      logger=dict(
+          type='LoggerHook',  # 记录训练/验证/测试阶段记录日志。
+          interval=20,  # 打印日志间隔
+          ignore_last=False), # 忽略每个轮次中最后一次迭代的日志
+      param_scheduler=dict(type='ParamSchedulerHook'),  # 更新优化器中一些超参数的钩子
+      checkpoint=dict(
+          type='CheckpointHook',  # 定期保存检查点的钩子
+          interval=3,  # 保存周期
+          save_best='auto',  # 在评估期间测量最佳检查点的指标
+          max_keep_ckpts=3),  # 要保留的最大检查点
+      sampler_seed=dict(type='DistSamplerSeedHook'),  # 分布式训练的数据加载采样器
+      sync_buffers=dict(type='SyncBuffersHook'))  # 在每个轮次结束时同步模型缓冲区
+  env_cfg = dict(  # 环境设置
+      cudnn_benchmark=False,  # 是否启用cudnn基准
+      mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), # 设置多线程处理的参数
+      dist_cfg=dict(backend='nccl')) # 设置分布式环境的参数，也可以设置端口
+
+  log_processor = dict(
+      type='LogProcessor',  # 用于格式化日志信息的日志处理器
+      window_size=20,  # 默认平滑间隔
+      by_epoch=True)  # 是否以epoch类型格式化日志
+  vis_backends = [  # 可视化后端列表
+      dict(type='LocalVisBackend')]  # 本地可视化后端
+  visualizer = dict(  # 可视化工具的配置
+      type='ActionVisualizer',  # 可视化工具的名称
+      vis_backends=vis_backends)
+  log_level = 'INFO'  # 日志记录级别
+  load_from = None  # 从给定路径加载模型checkpoint作为预训练模型。这不会恢复训练。
+  resume = False  # 是否从`load_from`中定义的checkpoint恢复。如果“load_from”为“None”，它将恢复“work_dir”中的最新的checkpoint。
+  ```
diff --git a/docs/zh_cn/user_guides/3_inference.md b/docs/zh_cn/user_guides/3_inference.md
index 20c346c7b2..99433263df 100644
--- a/docs/zh_cn/user_guides/3_inference.md
+++ b/docs/zh_cn/user_guides/3_inference.md
@@ -24,14 +24,11 @@ MMAction2提供了高级 Python APIs，用于对给定视频进行推理:
 
 ```python
 from mmaction.apis import inference_recognizer, init_recognizer
-from mmaction.utils import register_all_modules
 
 config_path = 'configs/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb.py'
 checkpoint_path = 'https://download.openmmlab.com/mmaction/v1.0/recognition/tsn/tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb/tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb_20220906-2692d16c.pth' # 可以是本地路径
 img_path = 'demo/demo.mp4'   # 您可以指定自己的视频路径
 
-# 注册所有模块，并将 MMACTION 设置为默认作用域。
-register_all_modules()
 # 从配置文件和检查点文件构建模型
 model = init_recognizer(config_path, checkpoint_path, device="cpu")  # 也可以是 'cuda:0'
 # 测试单个视频
diff --git a/mmaction/__init__.py b/mmaction/__init__.py
index dac9a6dacd..8266218eed 100644
--- a/mmaction/__init__.py
+++ b/mmaction/__init__.py
@@ -9,7 +9,7 @@
 mmcv_maximum_version = '2.1.0'
 mmcv_version = digit_version(mmcv.__version__)
 
-mmengine_minimum_version = '0.3.0'
+mmengine_minimum_version = '0.5.0'
 mmengine_maximum_version = '1.0.0'
 mmengine_version = digit_version(mmengine.__version__)
 
diff --git a/mmaction/apis/__init__.py b/mmaction/apis/__init__.py
index 110cbe9464..c4506d5af1 100644
--- a/mmaction/apis/__init__.py
+++ b/mmaction/apis/__init__.py
@@ -1,6 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from .inference import (detection_inference, inference_recognizer,
                         init_recognizer, pose_inference)
+from .inferencers import *  # NOQA
 
 __all__ = [
     'init_recognizer', 'inference_recognizer', 'detection_inference',
diff --git a/mmaction/apis/inference.py b/mmaction/apis/inference.py
index 64038e2c9a..ac014d0350 100644
--- a/mmaction/apis/inference.py
+++ b/mmaction/apis/inference.py
@@ -7,6 +7,7 @@
 import torch
 import torch.nn as nn
 from mmengine.dataset import Compose, pseudo_collate
+from mmengine.registry import init_default_scope
 from mmengine.runner import load_checkpoint
 from mmengine.utils import track_iter_progress
 
@@ -36,7 +37,10 @@ def init_recognizer(config: Union[str, Path, mmengine.Config],
         raise TypeError('config must be a filename or Config object, '
                         f'but got {type(config)}')
 
-    config.model.backbone.pretrained = None
+    init_default_scope(config.get('default_scope', 'mmaction'))
+
+    if config.model.backbone.get('pretrained', None):
+        config.model.backbone.pretrained = None
     model = MODELS.build(config.model)
 
     if checkpoint is not None:
diff --git a/mmaction/apis/inferencers/__init__.py b/mmaction/apis/inferencers/__init__.py
new file mode 100644
index 0000000000..9f62b667cf
--- /dev/null
+++ b/mmaction/apis/inferencers/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .actionrecog_inferencer import ActionRecogInferencer
+from .mmaction2_inferencer import MMAction2Inferencer
+
+__all__ = ['ActionRecogInferencer', 'MMAction2Inferencer']
diff --git a/mmaction/apis/inferencers/actionrecog_inferencer.py b/mmaction/apis/inferencers/actionrecog_inferencer.py
new file mode 100644
index 0000000000..9bfb3af7dd
--- /dev/null
+++ b/mmaction/apis/inferencers/actionrecog_inferencer.py
@@ -0,0 +1,360 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import mmengine
+import numpy as np
+from mmengine.dataset import Compose
+from mmengine.fileio import list_from_file
+from mmengine.infer.infer import BaseInferencer, ModelType
+from mmengine.registry import init_default_scope
+from mmengine.structures import InstanceData
+
+from mmaction.registry import INFERENCERS
+from mmaction.structures import ActionDataSample
+from mmaction.utils import ConfigType
+
+InstanceList = List[InstanceData]
+InputType = Union[str, np.ndarray]
+InputsType = Union[InputType, Sequence[InputType]]
+PredType = Union[InstanceData, InstanceList]
+ImgType = Union[np.ndarray, Sequence[np.ndarray]]
+ResType = Union[Dict, List[Dict], InstanceData, List[InstanceData]]
+
+
+@INFERENCERS.register_module(name='action-recognition')
+@INFERENCERS.register_module()
+class ActionRecogInferencer(BaseInferencer):
+    """The inferencer for action recognition.
+
+    Args:
+        model (str, optional): Path to the config file or the model name
+            defined in metafile. For example, it could be
+            "slowfast_r50_8xb8-8x8x1-256e_kinetics400-rgb" or
+            "configs/recognition/slowfast/slowfast_r50_8xb8-8x8x1-256e_kinetics400-rgb.py".
+        weights (str, optional): Path to the checkpoint. If it is not specified
+            and model is a model name of metafile, the weights will be loaded
+            from metafile. Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        label_file (str, optional): label file for dataset.
+        input_format (str): Input video format, Choices are 'video',
+            'rawframes', 'array'. 'video' means input data is a video file,
+            'rawframes' means input data is a video frame folder, and 'array'
+            means input data is a np.ndarray. Defaults to 'video'.
+        pack_cfg (dict, optional): Config for `InferencerPackInput` to load
+            input. Defaults to empty dict.
+        scope (str, optional): The scope of the model. Defaults to "mmaction".
+    """
+
+    preprocess_kwargs: set = set()
+    forward_kwargs: set = set()
+    visualize_kwargs: set = {
+        'return_vis', 'show', 'wait_time', 'vid_out_dir', 'draw_pred', 'fps',
+        'out_type', 'target_resolution'
+    }
+    postprocess_kwargs: set = {
+        'print_result', 'pred_out_file', 'return_datasample'
+    }
+
+    def __init__(self,
+                 model: Union[ModelType, str],
+                 weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 label_file: Optional[str] = None,
+                 input_format: str = 'video',
+                 pack_cfg: dict = {},
+                 scope: Optional[str] = 'mmaction') -> None:
+        # A global counter tracking the number of videos processed, for
+        # naming of the output videos
+        self.num_visualized_vids = 0
+        self.input_format = input_format
+        self.pack_cfg = pack_cfg.copy()
+        init_default_scope(scope)
+        super().__init__(
+            model=model, weights=weights, device=device, scope=scope)
+
+        if label_file is not None:
+            self.visualizer.dataset_meta = dict(
+                classes=list_from_file(label_file))
+
+    def __call__(self,
+                 inputs: InputsType,
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 return_vis: bool = False,
+                 show: bool = False,
+                 wait_time: int = 0,
+                 draw_pred: bool = True,
+                 vid_out_dir: str = '',
+                 out_type: str = 'video',
+                 print_result: bool = False,
+                 pred_out_file: str = '',
+                 target_resolution: Optional[Tuple[int]] = None,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+            return_datasamples (bool): Whether to return results as
+                :obj:`BaseDataElement`. Defaults to False.
+            batch_size (int): Inference batch size. Defaults to 1.
+            show (bool): Whether to display the visualization results in a
+                popup window. Defaults to False.
+            wait_time (float): The interval of show (s). Defaults to 0.
+            draw_pred (bool): Whether to draw predicted bounding boxes.
+                Defaults to True.
+            vid_out_dir (str): Output directory of visualization results.
+                If left as empty, no file will be saved. Defaults to ''.
+            out_type (str): Output type of visualization results.
+                Defaults to 'video'.
+            print_result (bool): Whether to print the inference result w/o
+                visualization to the console. Defaults to False.
+            pred_out_file: File to save the inference results w/o
+                visualization. If left as empty, no file will be saved.
+                Defaults to ''.
+
+            **kwargs: Other keyword arguments passed to :meth:`preprocess`,
+                :meth:`forward`, :meth:`visualize` and :meth:`postprocess`.
+                Each key in kwargs should be in the corresponding set of
+                ``preprocess_kwargs``, ``forward_kwargs``, ``visualize_kwargs``
+                and ``postprocess_kwargs``.
+
+        Returns:
+            dict: Inference and visualization results.
+        """
+        return super().__call__(
+            inputs,
+            return_datasamples,
+            batch_size,
+            return_vis=return_vis,
+            show=show,
+            wait_time=wait_time,
+            draw_pred=draw_pred,
+            vid_out_dir=vid_out_dir,
+            print_result=print_result,
+            pred_out_file=pred_out_file,
+            out_type=out_type,
+            target_resolution=target_resolution,
+            **kwargs)
+
+    def _inputs_to_list(self, inputs: InputsType) -> list:
+        """Preprocess the inputs to a list. The main difference from mmengine
+        version is that we don't list a directory cause input could be a frame
+        folder.
+
+        Preprocess inputs to a list according to its type:
+
+        - list or tuple: return inputs
+        - str: return a list containing the string. The string
+              could be a path to file, a url or other types of string according
+              to the task.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+
+        Returns:
+            list: List of input for the :meth:`preprocess`.
+        """
+        if not isinstance(inputs, (list, tuple)):
+            inputs = [inputs]
+
+        return list(inputs)
+
+    def _init_pipeline(self, cfg: ConfigType) -> Compose:
+        """Initialize the test pipeline."""
+        test_pipeline = cfg.test_dataloader.dataset.pipeline
+        # Alter data pipelines for decode
+        if self.input_format == 'array':
+            for i in range(len(test_pipeline)):
+                if 'Decode' in test_pipeline[i]['type']:
+                    test_pipeline[i] = dict(type='ArrayDecode')
+            test_pipeline = [
+                x for x in test_pipeline if 'Init' not in x['type']
+            ]
+        elif self.input_format == 'video':
+            if 'Init' not in test_pipeline[0]['type']:
+                test_pipeline = [dict(type='DecordInit')] + test_pipeline
+            else:
+                test_pipeline[0] = dict(type='DecordInit')
+            for i in range(len(test_pipeline)):
+                if 'Decode' in test_pipeline[i]['type']:
+                    test_pipeline[i] = dict(type='DecordDecode')
+        elif self.input_format == 'rawframes':
+            if 'Init' in test_pipeline[0]['type']:
+                test_pipeline = test_pipeline[1:]
+            for i in range(len(test_pipeline)):
+                if 'Decode' in test_pipeline[i]['type']:
+                    test_pipeline[i] = dict(type='RawFrameDecode')
+        # Alter data pipelines to close TTA, avoid OOM
+        # Use center crop instead of multiple crop
+        for i in range(len(test_pipeline)):
+            if test_pipeline[i]['type'] in ['ThreeCrop', 'TenCrop']:
+                test_pipeline[i]['type'] = 'CenterCrop'
+        # Use single clip for `Recognizer3D`
+        if cfg.model.type == 'Recognizer3D':
+            for i in range(len(test_pipeline)):
+                if test_pipeline[i]['type'] == 'SampleFrames':
+                    test_pipeline[i]['num_clips'] = 1
+        # Pack multiple types of input format
+        test_pipeline.insert(
+            0,
+            dict(
+                type='InferencerPackInput',
+                input_format=self.input_format,
+                **self.pack_cfg))
+
+        return Compose(test_pipeline)
+
+    def visualize(
+        self,
+        inputs: InputsType,
+        preds: PredType,
+        return_vis: bool = False,
+        show: bool = False,
+        wait_time: int = 0,
+        draw_pred: bool = True,
+        fps: int = 30,
+        out_type: str = 'video',
+        target_resolution: Optional[Tuple[int]] = None,
+        vid_out_dir: str = '',
+    ) -> Union[List[np.ndarray], None]:
+        """Visualize predictions.
+
+        Args:
+            inputs (List[Union[str, np.ndarray]]): Inputs for the inferencer.
+            preds (List[Dict]): Predictions of the model.
+            return_vis (bool): Whether to return the visualization result.
+                Defaults to False.
+            show (bool): Whether to display the image in a popup window.
+                Defaults to False.
+            wait_time (float): The interval of show (s). Defaults to 0.
+            draw_pred (bool): Whether to draw prediction labels.
+                Defaults to True.
+            fps (int): Frames per second for saving video. Defaults to 4.
+            out_type (str): Output format type, choose from 'img', 'gif',
+                'video'. Defaults to ``'img'``.
+            target_resolution (Tuple[int], optional): Set to
+                (desired_width desired_height) to have resized frames. If
+                either dimension is None, the frames are resized by keeping
+                the existing aspect ratio. Defaults to None.
+            vid_out_dir (str): Output directory of visualization results.
+                If left as empty, no file will be saved. Defaults to ''.
+
+        Returns:
+            List[np.ndarray] or None: Returns visualization results only if
+            applicable.
+        """
+        if self.visualizer is None or (not show and vid_out_dir == ''
+                                       and not return_vis):
+            return None
+
+        if getattr(self, 'visualizer') is None:
+            raise ValueError('Visualization needs the "visualizer" term'
+                             'defined in the config, but got None.')
+
+        results = []
+
+        for single_input, pred in zip(inputs, preds):
+            if isinstance(single_input, str):
+                frames = single_input
+                video_name = osp.basename(single_input)
+            elif isinstance(single_input, np.ndarray):
+                frames = single_input.copy()
+                video_num = str(self.num_visualized_vids).zfill(8)
+                video_name = f'{video_num}.mp4'
+            else:
+                raise ValueError('Unsupported input type: '
+                                 f'{type(single_input)}')
+
+            out_path = osp.join(vid_out_dir, video_name) if vid_out_dir != '' \
+                else None
+
+            visualization = self.visualizer.add_datasample(
+                video_name,
+                frames,
+                pred,
+                show_frames=show,
+                wait_time=wait_time,
+                draw_gt=False,
+                draw_pred=draw_pred,
+                fps=fps,
+                out_type=out_type,
+                out_path=out_path,
+                target_resolution=target_resolution,
+            )
+            results.append(visualization)
+            self.num_visualized_vids += 1
+
+        return results
+
+    def postprocess(
+        self,
+        preds: PredType,
+        visualization: Optional[List[np.ndarray]] = None,
+        return_datasample: bool = False,
+        print_result: bool = False,
+        pred_out_file: str = '',
+    ) -> Union[ResType, Tuple[ResType, np.ndarray]]:
+        """Process the predictions and visualization results from ``forward``
+        and ``visualize``.
+
+        This method should be responsible for the following tasks:
+
+        1. Convert datasamples into a json-serializable dict if needed.
+        2. Pack the predictions and visualization results and return them.
+        3. Dump or log the predictions.
+
+        Args:
+            preds (List[Dict]): Predictions of the model.
+            visualization (Optional[np.ndarray]): Visualized predictions.
+            return_datasample (bool): Whether to use Datasample to store
+                inference results. If False, dict will be used.
+            print_result (bool): Whether to print the inference result w/o
+                visualization to the console. Defaults to False.
+            pred_out_file: File to save the inference results w/o
+                visualization. If left as empty, no file will be saved.
+                Defaults to ''.
+
+        Returns:
+            dict: Inference and visualization results with key ``predictions``
+            and ``visualization``.
+
+            - ``visualization`` (Any): Returned by :meth:`visualize`.
+            - ``predictions`` (dict or DataSample): Returned by
+                :meth:`forward` and processed in :meth:`postprocess`.
+                If ``return_datasample=False``, it usually should be a
+                json-serializable dict containing only basic data elements such
+                as strings and numbers.
+        """
+        result_dict = {}
+        results = preds
+        if not return_datasample:
+            results = []
+            for pred in preds:
+                result = self.pred2dict(pred)
+                results.append(result)
+        # Add video to the results after printing and dumping
+        result_dict['predictions'] = results
+        if print_result:
+            print(result_dict)
+        if pred_out_file != '':
+            mmengine.dump(result_dict, pred_out_file)
+        result_dict['visualization'] = visualization
+        return result_dict
+
+    def pred2dict(self, data_sample: ActionDataSample) -> Dict:
+        """Extract elements necessary to represent a prediction into a
+        dictionary. It's better to contain only basic data elements such as
+        strings and numbers in order to guarantee it's json-serializable.
+
+        Args:
+            data_sample (ActionDataSample): The data sample to be converted.
+
+        Returns:
+            dict: The output dictionary.
+        """
+        result = {}
+        result['pred_labels'] = data_sample.pred_labels.item.tolist()
+        result['pred_scores'] = data_sample.pred_scores.item.tolist()
+        return result
diff --git a/mmaction/apis/inferencers/mmaction2_inferencer.py b/mmaction/apis/inferencers/mmaction2_inferencer.py
new file mode 100644
index 0000000000..0c1b4590de
--- /dev/null
+++ b/mmaction/apis/inferencers/mmaction2_inferencer.py
@@ -0,0 +1,232 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import mmengine
+import numpy as np
+from mmengine.infer import BaseInferencer
+from mmengine.structures import InstanceData
+
+from mmaction.utils import ConfigType
+from .actionrecog_inferencer import ActionRecogInferencer
+
+InstanceList = List[InstanceData]
+InputType = Union[str, np.ndarray]
+InputsType = Union[InputType, Sequence[InputType]]
+PredType = Union[InstanceData, InstanceList]
+ResType = Union[Dict, List[Dict], InstanceData, List[InstanceData]]
+
+
+class MMAction2Inferencer(BaseInferencer):
+    """MMAction2 Inferencer. It's a unified inferencer interface for video
+    analyse task, currently including: ActionRecog. and it can be used to
+    perform end-to-end action recognition inference.
+
+    Args:
+        rec (str, optional): Pretrained action recognition algorithm.
+            It's the path to the config file or the model name defined in
+            metafile. For example, it could be:
+
+            - model alias, e.g. ``'slowfast'``,
+            - config name, e.g. ``'slowfast_r50_8xb8-8x8x1-256e_kinetics400
+                -rgb'``,
+            - config path
+
+            Defaults to ``None``.
+        rec_weights (str, optional): Path to the custom checkpoint file of
+            the selected rec model. If it is not specified and "rec" is a model
+            name of metafile, the weights will be loaded from metafile.
+            Defaults to None.
+        device (str, optional): Device to run inference. For example,
+            it could be 'cuda' or 'cpu'. If None, the available
+            device will be automatically used. Defaults to None.
+        label_file (str, optional): label file for dataset.
+        input_format (str): Input video format, Choices are 'video',
+            'rawframes', 'array'. 'video' means input data is a video file,
+            'rawframes' means input data is a video frame folder, and 'array'
+            means input data is a np.ndarray. Defaults to 'video'.
+    """
+
+    preprocess_kwargs: set = set()
+    forward_kwargs: set = set()
+    visualize_kwargs: set = {
+        'return_vis', 'show', 'wait_time', 'vid_out_dir', 'draw_pred', 'fps',
+        'out_type', 'target_resolution'
+    }
+    postprocess_kwargs: set = {
+        'print_result', 'pred_out_file', 'return_datasample'
+    }
+
+    def __init__(self,
+                 rec: Optional[str] = None,
+                 rec_weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 label_file: Optional[str] = None,
+                 input_format: str = 'video') -> None:
+
+        if rec is None:
+            raise ValueError('rec algorithm should provided.')
+
+        self.visualizer = None
+        self.num_visualized_imgs = 0
+
+        if rec is not None:
+            self.actionrecog_inferencer = ActionRecogInferencer(
+                rec, rec_weights, device, label_file, input_format)
+            self.mode = 'rec'
+
+    def _init_pipeline(self, cfg: ConfigType) -> None:
+        pass
+
+    def forward(self, inputs: InputType, batch_size: int,
+                **forward_kwargs) -> PredType:
+        """Forward the inputs to the model.
+
+        Args:
+            inputs (InputsType): The inputs to be forwarded.
+            batch_size (int): Batch size. Defaults to 1.
+
+        Returns:
+            Dict: The prediction results. Possibly with keys "rec".
+        """
+        result = {}
+        if self.mode == 'rec':
+            predictions = self.actionrecog_inferencer(
+                inputs,
+                return_datasamples=True,
+                batch_size=batch_size,
+                **forward_kwargs)['predictions']
+            result['rec'] = [[p] for p in predictions]
+
+        return result
+
+    def visualize(self, inputs: InputsType, preds: PredType,
+                  **kwargs) -> List[np.ndarray]:
+        """Visualize predictions.
+
+        Args:
+            inputs (List[Union[str, np.ndarray]]): Inputs for the inferencer.
+            preds (List[Dict]): Predictions of the model.
+            show (bool): Whether to display the image in a popup window.
+                Defaults to False.
+            wait_time (float): The interval of show (s). Defaults to 0.
+            draw_pred (bool): Whether to draw predicted bounding boxes.
+                Defaults to True.
+            fps (int): Frames per second for saving video. Defaults to 4.
+            out_type (str): Output format type, choose from 'img', 'gif',
+                'video'. Defaults to ``'img'``.
+            target_resolution (Tuple[int], optional): Set to
+                (desired_width desired_height) to have resized frames. If
+                either dimension is None, the frames are resized by keeping
+                the existing aspect ratio. Defaults to None.
+            vid_out_dir (str): Output directory of visualization results.
+                If left as empty, no file will be saved. Defaults to ''.
+        """
+
+        if 'rec' in self.mode:
+            return self.actionrecog_inferencer.visualize(
+                inputs, preds['rec'][0], **kwargs)
+
+    def __call__(
+        self,
+        inputs: InputsType,
+        batch_size: int = 1,
+        **kwargs,
+    ) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer. It can be a path
+                to image / image directory, or an array, or a list of these.
+            return_datasamples (bool): Whether to return results as
+                :obj:`BaseDataElement`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            **kwargs: Key words arguments passed to :meth:`preprocess`,
+                :meth:`forward`, :meth:`visualize` and :meth:`postprocess`.
+                Each key in kwargs should be in the corresponding set of
+                ``preprocess_kwargs``, ``forward_kwargs``, ``visualize_kwargs``
+                and ``postprocess_kwargs``.
+
+        Returns:
+            dict: Inference and visualization results.
+        """
+        (
+            preprocess_kwargs,
+            forward_kwargs,
+            visualize_kwargs,
+            postprocess_kwargs,
+        ) = self._dispatch_kwargs(**kwargs)
+
+        ori_inputs = self._inputs_to_list(inputs)
+
+        preds = self.forward(ori_inputs, batch_size, **forward_kwargs)
+
+        visualization = self.visualize(
+            ori_inputs, preds,
+            **visualize_kwargs)  # type: ignore  # noqa: E501
+        results = self.postprocess(preds, visualization, **postprocess_kwargs)
+        return results
+
+    def _inputs_to_list(self, inputs: InputsType) -> list:
+        """Preprocess the inputs to a list. The main difference from mmengine
+        version is that we don't list a directory cause input could be a frame
+        folder.
+
+        Preprocess inputs to a list according to its type:
+
+        - list or tuple: return inputs
+        - str: return a list containing the string. The string
+              could be a path to file, a url or other types of string according
+              to the task.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+
+        Returns:
+            list: List of input for the :meth:`preprocess`.
+        """
+        if not isinstance(inputs, (list, tuple)):
+            inputs = [inputs]
+
+        return list(inputs)
+
+    def postprocess(self,
+                    preds: PredType,
+                    visualization: Optional[List[np.ndarray]] = None,
+                    print_result: bool = False,
+                    pred_out_file: str = ''
+                    ) -> Union[ResType, Tuple[ResType, np.ndarray]]:
+        """Postprocess predictions.
+
+        Args:
+            preds (Dict): Predictions of the model.
+            visualization (Optional[np.ndarray]): Visualized predictions.
+            print_result (bool): Whether to print the result.
+                Defaults to False.
+            pred_out_file (str): Output file name to store predictions
+                without images. Supported file formats are “json”, “yaml/yml”
+                and “pickle/pkl”. Defaults to ''.
+
+        Returns:
+            Dict or List[Dict]: Each dict contains the inference result of
+            each image. Possible keys are "rec_labels", "rec_scores"
+        """
+
+        result_dict = {}
+        pred_results = [{} for _ in range(len(next(iter(preds.values()))))]
+        if 'rec' in self.mode:
+            for i, rec_pred in enumerate(preds['rec']):
+                result = dict(rec_labels=[], rec_scores=[])
+                for rec_pred_instance in rec_pred:
+                    rec_dict_res = self.actionrecog_inferencer.pred2dict(
+                        rec_pred_instance)
+                    result['rec_labels'].append(rec_dict_res['pred_labels'])
+                    result['rec_scores'].append(rec_dict_res['pred_scores'])
+                pred_results[i].update(result)
+
+        result_dict['predictions'] = pred_results
+        if print_result:
+            print(result_dict)
+        if pred_out_file != '':
+            mmengine.dump(result_dict, pred_out_file)
+        result_dict['visualization'] = visualization
+        return result_dict
diff --git a/mmaction/datasets/activitynet_dataset.py b/mmaction/datasets/activitynet_dataset.py
index 4c0cd29f1c..3b492dd4f5 100644
--- a/mmaction/datasets/activitynet_dataset.py
+++ b/mmaction/datasets/activitynet_dataset.py
@@ -2,7 +2,7 @@
 from typing import Callable, List, Optional, Union
 
 import mmengine
-from mmengine.utils import check_file_exist
+from mmengine.fileio import exists
 
 from mmaction.registry import DATASETS
 from mmaction.utils import ConfigType
@@ -80,7 +80,7 @@ def __init__(self,
 
     def load_data_list(self) -> List[dict]:
         """Load annotation file to get video information."""
-        check_file_exist(self.ann_file)
+        exists(self.ann_file)
         data_list = []
         anno_database = mmengine.load(self.ann_file)
         for video_name in anno_database:
diff --git a/mmaction/datasets/ava_dataset.py b/mmaction/datasets/ava_dataset.py
index 8089d0fb75..1bc64c7b91 100644
--- a/mmaction/datasets/ava_dataset.py
+++ b/mmaction/datasets/ava_dataset.py
@@ -4,9 +4,8 @@
 from typing import Callable, List, Optional, Union
 
 import numpy as np
-from mmengine.fileio import load
+from mmengine.fileio import exists, list_from_file, load
 from mmengine.logging import MMLogger
-from mmengine.utils import check_file_exist
 
 from mmaction.evaluation import read_labelmap
 from mmaction.registry import DATASETS
@@ -199,36 +198,36 @@ def parse_img_record(self, img_records: List[dict]) -> tuple:
 
     def load_data_list(self) -> List[dict]:
         """Load AVA annotations."""
-        check_file_exist(self.ann_file)
+        exists(self.ann_file)
         data_list = []
         records_dict_by_img = defaultdict(list)
-        with open(self.ann_file, 'r') as fin:
-            for line in fin:
-                line_split = line.strip().split(',')
-
-                label = int(line_split[6])
-                if self.custom_classes is not None:
-                    if label not in self.custom_classes:
-                        continue
-                    label = self.custom_classes.index(label)
-
-                video_id = line_split[0]
-                timestamp = int(line_split[1])
-                img_key = f'{video_id},{timestamp:04d}'
-
-                entity_box = np.array(list(map(float, line_split[2:6])))
-                entity_id = int(line_split[7])
-                shot_info = (0, (self.timestamp_end - self.timestamp_start) *
-                             self._FPS)
-
-                video_info = dict(
-                    video_id=video_id,
-                    timestamp=timestamp,
-                    entity_box=entity_box,
-                    label=label,
-                    entity_id=entity_id,
-                    shot_info=shot_info)
-                records_dict_by_img[img_key].append(video_info)
+        fin = list_from_file(self.ann_file)
+        for line in fin:
+            line_split = line.strip().split(',')
+
+            label = int(line_split[6])
+            if self.custom_classes is not None:
+                if label not in self.custom_classes:
+                    continue
+                label = self.custom_classes.index(label)
+
+            video_id = line_split[0]
+            timestamp = int(line_split[1])
+            img_key = f'{video_id},{timestamp:04d}'
+
+            entity_box = np.array(list(map(float, line_split[2:6])))
+            entity_id = int(line_split[7])
+            shot_info = (0, (self.timestamp_end - self.timestamp_start) *
+                         self._FPS)
+
+            video_info = dict(
+                video_id=video_id,
+                timestamp=timestamp,
+                entity_box=entity_box,
+                label=label,
+                entity_id=entity_id,
+                shot_info=shot_info)
+            records_dict_by_img[img_key].append(video_info)
 
         for img_key in records_dict_by_img:
             video_id, timestamp = img_key.split(',')
@@ -530,36 +529,36 @@ def get_timestamp(self, video_id):
 
     def load_data_list(self) -> List[dict]:
         """Load AVA annotations."""
-        check_file_exist(self.ann_file)
+        exists(self.ann_file)
         data_list = []
         records_dict_by_img = defaultdict(list)
-        with open(self.ann_file, 'r') as fin:
-            for line in fin:
-                line_split = line.strip().split(',')
-
-                label = int(line_split[6])
-                if self.custom_classes is not None:
-                    if label not in self.custom_classes:
-                        continue
-                    label = self.custom_classes.index(label)
-
-                video_id = line_split[0]
-                timestamp = int(line_split[1])
-                img_key = f'{video_id},{timestamp:04d}'
-
-                entity_box = np.array(list(map(float, line_split[2:6])))
-                entity_id = int(line_split[7])
-                start, end = self.get_timestamp(video_id)
-                shot_info = (1, (end - start) * self._FPS + 1)
-
-                video_info = dict(
-                    video_id=video_id,
-                    timestamp=timestamp,
-                    entity_box=entity_box,
-                    label=label,
-                    entity_id=entity_id,
-                    shot_info=shot_info)
-                records_dict_by_img[img_key].append(video_info)
+        fin = list_from_file(self.ann_file)
+        for line in fin:
+            line_split = line.strip().split(',')
+
+            label = int(line_split[6])
+            if self.custom_classes is not None:
+                if label not in self.custom_classes:
+                    continue
+                label = self.custom_classes.index(label)
+
+            video_id = line_split[0]
+            timestamp = int(line_split[1])
+            img_key = f'{video_id},{timestamp:04d}'
+
+            entity_box = np.array(list(map(float, line_split[2:6])))
+            entity_id = int(line_split[7])
+            start, end = self.get_timestamp(video_id)
+            shot_info = (1, (end - start) * self._FPS + 1)
+
+            video_info = dict(
+                video_id=video_id,
+                timestamp=timestamp,
+                entity_box=entity_box,
+                label=label,
+                entity_id=entity_id,
+                shot_info=shot_info)
+            records_dict_by_img[img_key].append(video_info)
 
         for img_key in records_dict_by_img:
             video_id, timestamp = img_key.split(',')
diff --git a/mmaction/datasets/pose_dataset.py b/mmaction/datasets/pose_dataset.py
index 12227a3582..52c2c0b668 100644
--- a/mmaction/datasets/pose_dataset.py
+++ b/mmaction/datasets/pose_dataset.py
@@ -1,8 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Callable, List, Optional, Union
 
-from mmengine.fileio import load
-from mmengine.utils import check_file_exist
+from mmengine.fileio import exists, load
 
 from mmaction.registry import DATASETS
 from mmaction.utils import ConfigType
@@ -48,7 +47,7 @@ def __init__(self,
     def load_data_list(self) -> List[dict]:
         """Load annotation file to get skeleton information."""
         assert self.ann_file.endswith('.pkl')
-        check_file_exist(self.ann_file)
+        exists(self.ann_file)
         data_list = load(self.ann_file)
 
         if self.split is not None:
diff --git a/mmaction/datasets/rawframe_dataset.py b/mmaction/datasets/rawframe_dataset.py
index 86d40e44c8..8089e75917 100644
--- a/mmaction/datasets/rawframe_dataset.py
+++ b/mmaction/datasets/rawframe_dataset.py
@@ -2,7 +2,7 @@
 import os.path as osp
 from typing import Callable, List, Optional, Union
 
-from mmengine.utils import check_file_exist
+from mmengine.fileio import exists, list_from_file
 
 from mmaction.registry import DATASETS
 from mmaction.utils import ConfigType
@@ -109,38 +109,38 @@ def __init__(self,
 
     def load_data_list(self) -> List[dict]:
         """Load annotation file to get video information."""
-        check_file_exist(self.ann_file)
+        exists(self.ann_file)
         data_list = []
-        with open(self.ann_file, 'r') as fin:
-            for line in fin:
-                line_split = line.strip().split()
-                video_info = {}
-                idx = 0
-                # idx for frame_dir
-                frame_dir = line_split[idx]
-                if self.data_prefix['img'] is not None:
-                    frame_dir = osp.join(self.data_prefix['img'], frame_dir)
-                video_info['frame_dir'] = frame_dir
+        fin = list_from_file(self.ann_file)
+        for line in fin:
+            line_split = line.strip().split()
+            video_info = {}
+            idx = 0
+            # idx for frame_dir
+            frame_dir = line_split[idx]
+            if self.data_prefix['img'] is not None:
+                frame_dir = osp.join(self.data_prefix['img'], frame_dir)
+            video_info['frame_dir'] = frame_dir
+            idx += 1
+            if self.with_offset:
+                # idx for offset and total_frames
+                video_info['offset'] = int(line_split[idx])
+                video_info['total_frames'] = int(line_split[idx + 1])
+                idx += 2
+            else:
+                # idx for total_frames
+                video_info['total_frames'] = int(line_split[idx])
                 idx += 1
-                if self.with_offset:
-                    # idx for offset and total_frames
-                    video_info['offset'] = int(line_split[idx])
-                    video_info['total_frames'] = int(line_split[idx + 1])
-                    idx += 2
-                else:
-                    # idx for total_frames
-                    video_info['total_frames'] = int(line_split[idx])
-                    idx += 1
-                # idx for label[s]
-                label = [int(x) for x in line_split[idx:]]
-                assert label, f'missing label in line: {line}'
-                if self.multi_class:
-                    assert self.num_classes is not None
-                    video_info['label'] = label
-                else:
-                    assert len(label) == 1
-                    video_info['label'] = label[0]
-                data_list.append(video_info)
+            # idx for label[s]
+            label = [int(x) for x in line_split[idx:]]
+            assert label, f'missing label in line: {line}'
+            if self.multi_class:
+                assert self.num_classes is not None
+                video_info['label'] = label
+            else:
+                assert len(label) == 1
+                video_info['label'] = label[0]
+            data_list.append(video_info)
 
         return data_list
 
diff --git a/mmaction/datasets/repeat_aug_dataset.py b/mmaction/datasets/repeat_aug_dataset.py
index 47f517a916..7272d7991b 100644
--- a/mmaction/datasets/repeat_aug_dataset.py
+++ b/mmaction/datasets/repeat_aug_dataset.py
@@ -22,7 +22,8 @@ def get_type(transform: Union[dict, Callable]) -> str:
 
 @DATASETS.register_module()
 class RepeatAugDataset(VideoDataset):
-    """Video dataset for action recognition.
+    """Video dataset for action recognition use repeat augment.
+    https://arxiv.org/pdf/1901.09335.pdf.
 
     The dataset loads raw videos and apply specified transforms to return a
     dict containing the frame tensors and other information.
@@ -47,6 +48,10 @@ class RepeatAugDataset(VideoDataset):
             data transforms.
         data_prefix (dict or ConfigDict): Path to a directory where videos
             are held. Defaults to ``dict(video='')``.
+        num_repeats (int): Number of repeat time of one video in a batch.
+            Defaults to 4.
+        sample_once (bool): Determines whether use same frame index for
+            repeat samples. Defaults to False.
         multi_class (bool): Determines whether the dataset is a multi-class
             dataset. Defaults to False.
         num_classes (int, optional): Number of classes of the dataset, used in
@@ -66,6 +71,7 @@ def __init__(self,
                  pipeline: List[Union[dict, Callable]],
                  data_prefix: ConfigType = dict(video=''),
                  num_repeats: int = 4,
+                 sample_once: bool = False,
                  multi_class: bool = False,
                  num_classes: Optional[int] = None,
                  start_index: int = 0,
@@ -91,6 +97,7 @@ def __init__(self,
             test_mode=False,
             **kwargs)
         self.num_repeats = num_repeats
+        self.sample_once = sample_once
 
     def prepare_data(self, idx) -> List[dict]:
         """Get data processed by ``self.pipeline``.
@@ -112,11 +119,20 @@ def prepare_data(self, idx) -> List[dict]:
             total_frames=data_info['total_frames'],
             start_index=data_info['start_index'])
 
-        for repeat in range(self.num_repeats):
+        if not self.sample_once:
+            for repeat in range(self.num_repeats):
+                data_info_ = transforms[1](fake_data_info)  # SampleFrames
+                frame_inds = data_info_['frame_inds']
+                frame_inds_list.append(frame_inds.reshape(-1))
+                frame_inds_length.append(frame_inds.size +
+                                         frame_inds_length[-1])
+        else:
             data_info_ = transforms[1](fake_data_info)  # SampleFrames
             frame_inds = data_info_['frame_inds']
-            frame_inds_list.append(frame_inds.reshape(-1))
-            frame_inds_length.append(frame_inds.size + frame_inds_length[-1])
+            for repeat in range(self.num_repeats):
+                frame_inds_list.append(frame_inds.reshape(-1))
+                frame_inds_length.append(frame_inds.size +
+                                         frame_inds_length[-1])
 
         for key in data_info_:
             data_info[key] = data_info_[key]
diff --git a/mmaction/datasets/transforms/loading.py b/mmaction/datasets/transforms/loading.py
index 558579b87f..8305a490b8 100644
--- a/mmaction/datasets/transforms/loading.py
+++ b/mmaction/datasets/transforms/loading.py
@@ -4,7 +4,7 @@
 import os
 import os.path as osp
 import shutil
-from typing import Optional
+from typing import Optional, Union
 
 import mmcv
 import numpy as np
@@ -460,16 +460,20 @@ def _get_sample_clips(self, num_frames: int) -> np.array:
         Returns:
             seq (list): the indexes of frames of sampled from the video.
         """
-        assert self.num_clips == 1
         seg_size = float(num_frames - 1) / self.clip_len
         inds = []
-        for i in range(self.clip_len):
-            start = int(np.round(seg_size * i))
-            end = int(np.round(seg_size * (i + 1)))
-            if not self.test_mode:
+        if not self.test_mode:
+            for i in range(self.clip_len):
+                start = int(np.round(seg_size * i))
+                end = int(np.round(seg_size * (i + 1)))
                 inds.append(np.random.randint(start, end + 1))
-            else:
-                inds.append((start + end) // 2)
+        else:
+            duration = seg_size / (self.num_clips + 1)
+            for k in range(self.num_clips):
+                for i in range(self.clip_len):
+                    start = int(np.round(seg_size * i))
+                    frame_index = start + int(duration * (k + 1))
+                    inds.append(frame_index)
 
         return np.array(inds)
 
@@ -1398,6 +1402,61 @@ def __repr__(self):
         return repr_str
 
 
+@TRANSFORMS.register_module()
+class InferencerPackInput(BaseTransform):
+
+    def __init__(self,
+                 input_format='video',
+                 filename_tmpl='img_{:05}.jpg',
+                 modality='RGB',
+                 start_index=1) -> None:
+        self.input_format = input_format
+        self.filename_tmpl = filename_tmpl
+        self.modality = modality
+        self.start_index = start_index
+
+    def transform(self, video: Union[str, np.ndarray, dict]) -> dict:
+        if self.input_format == 'dict':
+            results = video
+        elif self.input_format == 'video':
+            results = dict(
+                filename=video, label=-1, start_index=0, modality='RGB')
+        elif self.input_format == 'rawframes':
+            import re
+
+            # count the number of frames that match the format of
+            # `filename_tmpl`
+            # RGB pattern example: img_{:05}.jpg -> ^img_\d+.jpg$
+            # Flow patteren example: {}_{:05d}.jpg -> ^x_\d+.jpg$
+            pattern = f'^{self.filename_tmpl}$'
+            if self.modality == 'Flow':
+                pattern = pattern.replace('{}', 'x')
+            pattern = pattern.replace(
+                pattern[pattern.find('{'):pattern.find('}') + 1], '\\d+')
+            total_frames = len(
+                list(
+                    filter(lambda x: re.match(pattern, x) is not None,
+                           os.listdir(video))))
+            results = dict(
+                frame_dir=video,
+                total_frames=total_frames,
+                label=-1,
+                start_index=self.start_index,
+                filename_tmpl=self.filename_tmpl,
+                modality=self.modality)
+        elif self.input_format == 'array':
+            modality_map = {2: 'Flow', 3: 'RGB'}
+            modality = modality_map.get(video.shape[-1])
+            results = dict(
+                total_frames=video.shape[0],
+                label=-1,
+                start_index=0,
+                array=video,
+                modality=modality)
+
+        return results
+
+
 @TRANSFORMS.register_module()
 class ArrayDecode(BaseTransform):
     """Load and decode frames with given indices from a 4D array.
diff --git a/mmaction/datasets/video_dataset.py b/mmaction/datasets/video_dataset.py
index a46695e40a..e085a8bcac 100644
--- a/mmaction/datasets/video_dataset.py
+++ b/mmaction/datasets/video_dataset.py
@@ -2,7 +2,7 @@
 import os.path as osp
 from typing import Callable, List, Optional, Union
 
-from mmengine.utils import check_file_exist
+from mmengine.fileio import exists, list_from_file
 
 from mmaction.registry import DATASETS
 from mmaction.utils import ConfigType
@@ -44,10 +44,12 @@ class VideoDataset(BaseActionDataset):
             different filename format. However, when taking videos as input,
             it should be set to 0, since frames loaded from videos count
             from 0. Defaults to 0.
-        modality (str): Modality of data. Support ``RGB``, ``Flow``.
-            Defaults to ``RGB``.
+        modality (str): Modality of data. Support ``'RGB'``, ``'Flow'``.
+            Defaults to ``'RGB'``.
         test_mode (bool): Store True when building test or validation dataset.
             Defaults to False.
+        delimiter (str): Delimiter for the annotation file.
+            Defaults to ``' '`` (whitespace).
     """
 
     def __init__(self,
@@ -59,7 +61,9 @@ def __init__(self,
                  start_index: int = 0,
                  modality: str = 'RGB',
                  test_mode: bool = False,
+                 delimiter: str = ' ',
                  **kwargs) -> None:
+        self.delimiter = delimiter
         super().__init__(
             ann_file,
             pipeline=pipeline,
@@ -73,19 +77,19 @@ def __init__(self,
 
     def load_data_list(self) -> List[dict]:
         """Load annotation file to get video information."""
-        check_file_exist(self.ann_file)
+        exists(self.ann_file)
         data_list = []
-        with open(self.ann_file, 'r') as fin:
-            for line in fin:
-                line_split = line.strip().split()
-                if self.multi_class:
-                    assert self.num_classes is not None
-                    filename, label = line_split[0], line_split[1:]
-                    label = list(map(int, label))
-                else:
-                    filename, label = line_split
-                    label = int(label)
-                if self.data_prefix['video'] is not None:
-                    filename = osp.join(self.data_prefix['video'], filename)
-                data_list.append(dict(filename=filename, label=label))
+        fin = list_from_file(self.ann_file)
+        for line in fin:
+            line_split = line.strip().split(self.delimiter)
+            if self.multi_class:
+                assert self.num_classes is not None
+                filename, label = line_split[0], line_split[1:]
+                label = list(map(int, label))
+            else:
+                filename, label = line_split
+                label = int(label)
+            if self.data_prefix['video'] is not None:
+                filename = osp.join(self.data_prefix['video'], filename)
+            data_list.append(dict(filename=filename, label=label))
         return data_list
diff --git a/mmaction/engine/hooks/visualization_hook.py b/mmaction/engine/hooks/visualization_hook.py
index e4756ca817..b1c3ac8b47 100644
--- a/mmaction/engine/hooks/visualization_hook.py
+++ b/mmaction/engine/hooks/visualization_hook.py
@@ -91,7 +91,7 @@ def _draw_samples(self,
 
             draw_args = self.draw_args
             if self.out_dir is not None:
-                draw_args['out_folder'] = self.file_client.join_path(
+                draw_args['out_path'] = self.file_client.join_path(
                     self.out_dir, f'{sample_name}_{step}')
 
             self._visualizer.add_datasample(
diff --git a/mmaction/engine/optimizers/__init__.py b/mmaction/engine/optimizers/__init__.py
index ce4f9ba0cd..b11186c82e 100644
--- a/mmaction/engine/optimizers/__init__.py
+++ b/mmaction/engine/optimizers/__init__.py
@@ -1,5 +1,10 @@
 # Copyright (c) OpenMMLab. All rights reserved.
+from .layer_decay_optim_wrapper_constructor import \
+    LearningRateDecayOptimizerConstructor
 from .swin_optim_wrapper_constructor import SwinOptimWrapperConstructor
 from .tsm_optim_wrapper_constructor import TSMOptimWrapperConstructor
 
-__all__ = ['TSMOptimWrapperConstructor', 'SwinOptimWrapperConstructor']
+__all__ = [
+    'TSMOptimWrapperConstructor', 'SwinOptimWrapperConstructor',
+    'LearningRateDecayOptimizerConstructor'
+]
diff --git a/mmaction/engine/optimizers/layer_decay_optim_wrapper_constructor.py b/mmaction/engine/optimizers/layer_decay_optim_wrapper_constructor.py
new file mode 100644
index 0000000000..966b786508
--- /dev/null
+++ b/mmaction/engine/optimizers/layer_decay_optim_wrapper_constructor.py
@@ -0,0 +1,123 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+from typing import List
+
+import torch.nn as nn
+from mmengine.dist import get_dist_info
+from mmengine.logging import MMLogger
+from mmengine.optim import DefaultOptimWrapperConstructor
+
+from mmaction.registry import OPTIM_WRAPPER_CONSTRUCTORS
+
+
+def get_layer_id_for_mvit(var_name, max_layer_id):
+    """Get the layer id to set the different learning rates in ``layer_wise``
+    decay_type.
+
+    Args:
+        var_name (str): The key of the model.
+        max_layer_id (int): Maximum layer id.
+
+    Returns:
+        int: The id number corresponding to different learning rate in
+        ``LearningRateDecayOptimizerConstructor``.
+    """
+
+    if var_name in ('backbone.cls_token', 'backbone.mask_token',
+                    'backbone.pos_embed'):
+        return 0
+    elif var_name.startswith('backbone.patch_embed'):
+        return 0
+    elif var_name.startswith('backbone.blocks'):
+        layer_id = int(var_name.split('.')[2]) + 1
+        return layer_id
+    else:
+        return max_layer_id + 1
+
+
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class LearningRateDecayOptimizerConstructor(DefaultOptimWrapperConstructor):
+    """
+    Different learning rates are set for different layers of backbone.
+    Note: Currently, this optimizer constructor is built for MViT.
+
+    Inspiration from `the implementation in PySlowFast
+    <https://github.com/facebookresearch/SlowFast>`_ and MMDetection
+    <https://github.com/open-mmlab/mmdetection/tree/dev-3.x>`_
+    """
+
+    def add_params(self, params: List[dict], module: nn.Module,
+                   **kwargs) -> None:
+        """Add all parameters of module to the params list.
+
+        The parameters of the given module will be added to the list of param
+        groups, with specific rules defined by paramwise_cfg.
+
+        Args:
+            params (list[dict]): A list of param groups, it will be modified
+                in place.
+            module (nn.Module): The module to be added.
+        """
+        logger = MMLogger.get_current_instance()
+
+        parameter_groups = {}
+        logger.info(f'self.paramwise_cfg is {self.paramwise_cfg}')
+        num_layers = self.paramwise_cfg.get('num_layers')
+        decay_rate = self.paramwise_cfg.get('decay_rate')
+        decay_type = self.paramwise_cfg.get('decay_type', 'layer_wise')
+        logger.info('Build LearningRateDecayOptimizerConstructor  '
+                    f'{decay_type} {decay_rate} - {num_layers}')
+        weight_decay = self.base_wd
+
+        for m in module.modules():
+            assert not isinstance(m, nn.modules.batchnorm._NormBase
+                                  ), 'BN is not supported with layer decay'
+
+        for name, param in module.named_parameters():
+            if not param.requires_grad:
+                continue  # frozen weights
+            if len(param.shape) == 1 or name.endswith('.bias'):
+                group_name = 'no_decay'
+                this_weight_decay = 0.
+            else:
+                group_name = 'decay'
+                this_weight_decay = weight_decay
+            if 'layer_wise' in decay_type:
+                if 'MViT' in module.backbone.__class__.__name__:
+                    layer_id = get_layer_id_for_mvit(
+                        name, self.paramwise_cfg.get('num_layers'))
+                    logger.info(f'set param {name} as id {layer_id}')
+                else:
+                    raise NotImplementedError()
+            else:
+                raise NotImplementedError(f'Only support layer wise decay,'
+                                          f'but got {decay_type}')
+
+            group_name = f'layer_{layer_id}_{group_name}'
+
+            if group_name not in parameter_groups:
+                scale = decay_rate**(num_layers - layer_id + 1)
+
+                parameter_groups[group_name] = {
+                    'weight_decay': this_weight_decay,
+                    'params': [],
+                    'param_names': [],
+                    'lr_scale': scale,
+                    'group_name': group_name,
+                    'lr': scale * self.base_lr,
+                }
+
+            parameter_groups[group_name]['params'].append(param)
+            parameter_groups[group_name]['param_names'].append(name)
+        rank, _ = get_dist_info()
+        if rank == 0:
+            to_display = {}
+            for key in parameter_groups:
+                to_display[key] = {
+                    'param_names': parameter_groups[key]['param_names'],
+                    'lr_scale': parameter_groups[key]['lr_scale'],
+                    'lr': parameter_groups[key]['lr'],
+                    'weight_decay': parameter_groups[key]['weight_decay'],
+                }
+            logger.info(f'Param groups = {json.dumps(to_display, indent=2)}')
+        params.extend(parameter_groups.values())
diff --git a/mmaction/models/backbones/__init__.py b/mmaction/models/backbones/__init__.py
index d634099cb6..066ba18535 100644
--- a/mmaction/models/backbones/__init__.py
+++ b/mmaction/models/backbones/__init__.py
@@ -19,6 +19,8 @@
 from .swin import SwinTransformer3D
 from .tanet import TANet
 from .timesformer import TimeSformer
+from .uniformer import UniFormer
+from .uniformerv2 import UniFormerV2
 from .vit_mae import VisionTransformer
 from .x3d import X3D
 
@@ -27,5 +29,5 @@
     'OmniResNet', 'ResNet', 'ResNet2Plus1d', 'ResNet3d', 'ResNet3dCSN',
     'ResNet3dLayer', 'ResNet3dSlowFast', 'ResNet3dSlowOnly', 'ResNetAudio',
     'ResNetTIN', 'ResNetTSM', 'STGCN', 'SwinTransformer3D', 'TANet',
-    'TimeSformer', 'VisionTransformer', 'X3D'
+    'TimeSformer', 'UniFormer', 'UniFormerV2', 'VisionTransformer', 'X3D'
 ]
diff --git a/mmaction/models/backbones/mvit.py b/mmaction/models/backbones/mvit.py
index 95f917f136..182741f495 100644
--- a/mmaction/models/backbones/mvit.py
+++ b/mmaction/models/backbones/mvit.py
@@ -7,8 +7,10 @@
 import torch.nn.functional as F
 from mmcv.cnn import build_activation_layer, build_norm_layer
 from mmcv.cnn.bricks import DropPath
+from mmengine.logging import MMLogger
 from mmengine.model import BaseModule, ModuleList
 from mmengine.model.weight_init import trunc_normal_
+from mmengine.runner.checkpoint import _load_checkpoint_with_prefix
 from mmengine.utils import to_3tuple
 
 from mmaction.registry import MODELS
@@ -557,6 +559,11 @@ class MViT(BaseModule):
         temporal_size (int): The expected input temporal_size shape.
             Defaults to 224.
         in_channels (int): The num of input channels. Defaults to 3.
+        pretrained (str, optional): Name of pretrained model.
+            Defaults to None.
+        pretrained_type (str, optional): Type of pretrained model. choose from
+            'imagenet', 'maskfeat', None. Defaults to None, which means load
+            from same architecture.
         out_scales (int | Sequence[int]): The output scale indices.
             They should not exceed the length of ``downscale_indices``.
             Defaults to -1, which means the last scale.
@@ -656,6 +663,7 @@ def __init__(
         temporal_size: int = 16,
         in_channels: int = 3,
         pretrained: Optional[str] = None,
+        pretrained_type: Optional[str] = None,
         out_scales: Union[int, Sequence[int]] = -1,
         drop_path_rate: float = 0.,
         use_abs_pos_embed: bool = False,
@@ -677,13 +685,14 @@ def __init__(
             kernel_size=(3, 7, 7), stride=(2, 4, 4), padding=(1, 3, 3)),
         init_cfg: Optional[Union[Dict, List[Dict]]] = [
             dict(type='TruncNormal', layer=['Conv2d', 'Conv3d'], std=0.02),
-            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.02),
             dict(type='Constant', layer='LayerNorm', val=1., bias=0.02),
         ]
     ) -> None:
         if pretrained:
-            self.init_cfg = dict(type='Pretrained', checkpoint=pretrained)
-        super().__init__(init_cfg=init_cfg)
+            init_cfg = dict(type='Pretrained', checkpoint=pretrained)
+        super().__init__(init_cfg=init_cfg.copy())
+        self.pretrained_type = pretrained_type
 
         if isinstance(arch, str):
             arch = arch.lower()
@@ -702,6 +711,9 @@ def __init__(
         self.num_layers = self.arch_settings['num_layers']
         self.num_heads = self.arch_settings['num_heads']
         self.downscale_indices = self.arch_settings['downscale_indices']
+        # Defaults take downscale_indices as downscale_indices
+        self.dim_mul_indices = self.arch_settings.get(
+            'dim_mul_indices', self.downscale_indices.copy())
         self.num_scales = len(self.downscale_indices) + 1
         self.stage_indices = {
             index - 1: i
@@ -758,19 +770,21 @@ def __init__(
         stride_kv = adaptive_kv_stride
         input_size = self.patch_resolution
         for i in range(self.num_layers):
-            if i in self.downscale_indices:
+            if i in self.downscale_indices or i in self.dim_mul_indices:
                 num_heads *= head_mul
+
+            if i in self.downscale_indices:
                 stride_q = [1, 2, 2]
                 stride_kv = [max(s // 2, 1) for s in stride_kv]
             else:
                 stride_q = [1, 1, 1]
 
             # Set output embed_dims
-            if dim_mul_in_attention and i in self.downscale_indices:
-                # multiply embed_dims in downscale layers.
+            if dim_mul_in_attention and i in self.dim_mul_indices:
+                # multiply embed_dims in dim_mul_indices layers.
                 out_dims = out_dims_list[-1] * dim_mul
-            elif not dim_mul_in_attention and i + 1 in self.downscale_indices:
-                # multiply embed_dims before downscale layers.
+            elif not dim_mul_in_attention and i + 1 in self.dim_mul_indices:
+                # multiply embed_dims before dim_mul_indices layers.
                 out_dims = out_dims_list[-1] * dim_mul
             else:
                 out_dims = out_dims_list[-1]
@@ -803,12 +817,44 @@ def __init__(
                     self.add_module(f'norm{stage_index}', norm_layer)
 
     def init_weights(self, pretrained: Optional[str] = None) -> None:
-        super().init_weights()
-
-        if (isinstance(self.init_cfg, dict)
-                and self.init_cfg['type'] == 'Pretrained'):
-            # Suppress default init if use pretrained model.
-            return
+        # interpolate maskfeat relative position embedding
+        if self.pretrained_type == 'maskfeat':
+            logger = MMLogger.get_current_instance()
+            pretrained = self.init_cfg['checkpoint']
+            logger.info(f'load pretrained model from {pretrained}')
+            state_dict = _load_checkpoint_with_prefix(
+                'backbone.', pretrained, map_location='cpu')
+            attn_rel_pos_keys = [
+                k for k in state_dict.keys() if 'attn.rel_pos' in k
+            ]
+            for k in attn_rel_pos_keys:
+                attn_rel_pos_pretrained = state_dict[k]
+                attn_rel_pos_current = self.state_dict()[k]
+                L1, dim1 = attn_rel_pos_pretrained.size()
+                L2, dim2 = attn_rel_pos_current.size()
+                if dim1 != dim2:
+                    logger.warning(f'Dim mismatch in loading {k}, passing')
+                else:
+                    if L1 != L2:
+                        interp_param = torch.nn.functional.interpolate(
+                            attn_rel_pos_pretrained.t().unsqueeze(0),
+                            size=L2,
+                            mode='linear')
+                        interp_param = \
+                            interp_param.view(dim2, L2).permute(1, 0)
+                        state_dict[k] = interp_param
+                        logger.info(
+                            f'{k} reshaped from {(L1, dim1)} to {L2, dim2}')
+            msg = self.load_state_dict(state_dict, strict=False)
+            logger.info(msg)
+
+        elif self.pretrained_type is None:
+            super().init_weights()
+
+            if (isinstance(self.init_cfg, dict)
+                    and self.init_cfg['type'] == 'Pretrained'):
+                # Suppress default init if use pretrained model.
+                return
 
         if self.use_abs_pos_embed:
             trunc_normal_(self.pos_embed, std=0.02)
diff --git a/mmaction/models/backbones/uniformer.py b/mmaction/models/backbones/uniformer.py
new file mode 100644
index 0000000000..97ac6184c1
--- /dev/null
+++ b/mmaction/models/backbones/uniformer.py
@@ -0,0 +1,669 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import DropPath
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModule, ModuleList
+from mmengine.runner.checkpoint import _load_checkpoint
+from mmengine.utils import to_2tuple
+
+from mmaction.registry import MODELS
+
+logger = MMLogger.get_current_instance()
+
+MODEL_PATH = 'https://download.openmmlab.com/mmaction/v1.0/recognition'
+_MODELS = {
+    'uniformer_small_in1k':
+    os.path.join(MODEL_PATH,
+                 'uniformerv1/uniformer_small_in1k_20221219-fe0a7ae0.pth'),
+    'uniformer_base_in1k':
+    os.path.join(MODEL_PATH,
+                 'uniformerv1/uniformer_base_in1k_20221219-82c01015.pth'),
+}
+
+
+def conv_3xnxn(inp: int,
+               oup: int,
+               kernel_size: int = 3,
+               stride: int = 3,
+               groups: int = 1):
+    """3D convolution with kernel size of 3xnxn.
+
+    Args:
+        inp (int): Dimension of input features.
+        oup (int): Dimension of output features.
+        kernel_size (int): The spatial kernel size (i.e., n).
+            Defaults to 3.
+        stride (int): The spatial stride.
+            Defaults to 3.
+        groups (int): Group number of operated features.
+            Defaults to 1.
+    """
+    return nn.Conv3d(
+        inp,
+        oup, (3, kernel_size, kernel_size), (2, stride, stride), (1, 0, 0),
+        groups=groups)
+
+
+def conv_1xnxn(inp: int,
+               oup: int,
+               kernel_size: int = 3,
+               stride: int = 3,
+               groups: int = 1):
+    """3D convolution with kernel size of 1xnxn.
+
+    Args:
+        inp (int): Dimension of input features.
+        oup (int): Dimension of output features.
+        kernel_size (int): The spatial kernel size (i.e., n).
+            Defaults to 3.
+        stride (int): The spatial stride.
+            Defaults to 3.
+        groups (int): Group number of operated features.
+            Defaults to 1.
+    """
+    return nn.Conv3d(
+        inp,
+        oup, (1, kernel_size, kernel_size), (1, stride, stride), (0, 0, 0),
+        groups=groups)
+
+
+def conv_1x1x1(inp: int, oup: int, groups: int = 1):
+    """3D convolution with kernel size of 1x1x1.
+
+    Args:
+        inp (int): Dimension of input features.
+        oup (int): Dimension of output features.
+        groups (int): Group number of operated features.
+            Defaults to 1.
+    """
+    return nn.Conv3d(inp, oup, (1, 1, 1), (1, 1, 1), (0, 0, 0), groups=groups)
+
+
+def conv_3x3x3(inp: int, oup: int, groups: int = 1):
+    """3D convolution with kernel size of 3x3x3.
+
+    Args:
+        inp (int): Dimension of input features.
+        oup (int): Dimension of output features.
+        groups (int): Group number of operated features.
+            Defaults to 1.
+    """
+    return nn.Conv3d(inp, oup, (3, 3, 3), (1, 1, 1), (1, 1, 1), groups=groups)
+
+
+def conv_5x5x5(inp: int, oup: int, groups: int = 1):
+    """3D convolution with kernel size of 5x5x5.
+
+    Args:
+        inp (int): Dimension of input features.
+        oup (int): Dimension of output features.
+        groups (int): Group number of operated features.
+            Defaults to 1.
+    """
+    return nn.Conv3d(inp, oup, (5, 5, 5), (1, 1, 1), (2, 2, 2), groups=groups)
+
+
+def bn_3d(dim):
+    """3D batch normalization.
+
+    Args:
+        dim (int): Dimension of input features.
+    """
+    return nn.BatchNorm3d(dim)
+
+
+class Mlp(BaseModule):
+    """Multilayer perceptron.
+
+    Args:
+        in_features (int): Number of input features.
+        hidden_features (int): Number of hidden features.
+            Defaults to None.
+        out_features (int): Number of output features.
+            Defaults to None.
+        drop (float): Dropout rate. Defaults to 0.0.
+        init_cfg (dict, optional): Config dict for initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        in_features: int,
+        hidden_features: int = None,
+        out_features: int = None,
+        drop: float = 0.,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = nn.GELU()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class Attention(BaseModule):
+    """Self-Attention.
+
+    Args:
+        dim (int): Number of input features.
+        num_heads (int): Number of attention heads.
+            Defaults to 8.
+        qkv_bias (bool): If True, add a learnable bias to query, key, value.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        attn_drop (float): Attention dropout rate.
+            Defaults to 0.0.
+        proj_drop (float): Dropout rate.
+            Defaults to 0.0.
+        init_cfg (dict, optional): Config dict for initialization.
+            Defaults to None.
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int = 8,
+        qkv_bias: bool = True,
+        qk_scale: float = None,
+        attn_drop: float = 0.,
+        proj_drop: float = 0.,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        # NOTE scale factor was wrong in my original version,
+        # can set manually to be compat with prev weights
+        self.scale = qk_scale or head_dim**-0.5
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class CMlp(BaseModule):
+    """Multilayer perceptron via convolution.
+
+    Args:
+        in_features (int): Number of input features.
+        hidden_features (int): Number of hidden features.
+            Defaults to None.
+        out_features (int): Number of output features.
+            Defaults to None.
+        drop (float): Dropout rate. Defaults to 0.0.
+        init_cfg (dict, optional): Config dict for initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        drop=0.,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = conv_1x1x1(in_features, hidden_features)
+        self.act = nn.GELU()
+        self.fc2 = conv_1x1x1(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class CBlock(BaseModule):
+    """Convolution Block.
+
+    Args:
+        dim (int): Number of input features.
+        mlp_ratio (float): Ratio of mlp hidden dimension
+            to embedding dimension. Defaults to 4.
+        drop (float): Dropout rate.
+            Defaults to 0.0.
+        drop_paths (float): Stochastic depth rates.
+            Defaults to 0.0.
+        init_cfg (dict, optional): Config dict for initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        dim: int,
+        mlp_ratio: float = 4.,
+        drop: float = 0.,
+        drop_path: float = 0.,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.pos_embed = conv_3x3x3(dim, dim, groups=dim)
+        self.norm1 = bn_3d(dim)
+        self.conv1 = conv_1x1x1(dim, dim, 1)
+        self.conv2 = conv_1x1x1(dim, dim, 1)
+        self.attn = conv_5x5x5(dim, dim, groups=dim)
+        # NOTE: drop path for stochastic depth,
+        # we shall see if this is better than dropout here
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = bn_3d(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = CMlp(
+            in_features=dim, hidden_features=mlp_hidden_dim, drop=drop)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = x + self.pos_embed(x)
+        x = x + self.drop_path(
+            self.conv2(self.attn(self.conv1(self.norm1(x)))))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+
+
+class SABlock(BaseModule):
+    """Self-Attention Block.
+
+    Args:
+        dim (int): Number of input features.
+        num_heads (int): Number of attention heads.
+        mlp_ratio (float): Ratio of mlp hidden dimension
+            to embedding dimension. Defaults to 4.
+        qkv_bias (bool): If True, add a learnable bias to query, key, value.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        drop (float): Dropout rate. Defaults to 0.0.
+        attn_drop (float): Attention dropout rate. Defaults to 0.0.
+        drop_paths (float): Stochastic depth rates.
+            Defaults to 0.0.
+        init_cfg (dict, optional): Config dict for initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        mlp_ratio: float = 4.,
+        qkv_bias: bool = False,
+        qk_scale: float = None,
+        drop: float = 0.,
+        attn_drop: float = 0.,
+        drop_path: float = 0.,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.pos_embed = conv_3x3x3(dim, dim, groups=dim)
+        self.norm1 = nn.LayerNorm(dim)
+        self.attn = Attention(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop,
+            proj_drop=drop)
+        # NOTE: drop path for stochastic depth,
+        # we shall see if this is better than dropout here
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = nn.LayerNorm(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim, hidden_features=mlp_hidden_dim, drop=drop)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = x + self.pos_embed(x)
+        B, C, T, H, W = x.shape
+        x = x.flatten(2).transpose(1, 2)
+        x = x + self.drop_path(self.attn(self.norm1(x)))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        x = x.transpose(1, 2).reshape(B, C, T, H, W)
+        return x
+
+
+class SpeicalPatchEmbed(BaseModule):
+    """Image to Patch Embedding.
+
+    Add extra temporal downsampling via temporal kernel size of 3.
+
+    Args:
+        img_size (int): Number of input size.
+            Defaults to 224.
+        patch_size (int): Number of patch size.
+            Defaults to 16.
+        in_chans (int): Number of input features.
+            Defaults to 3.
+        embed_dim (int): Number of output features.
+            Defaults to 768.
+        init_cfg (dict, optional): Config dict for initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        img_size=224,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        num_patches = (img_size[1] // patch_size[1]) * (
+            img_size[0] // patch_size[0])
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+        self.norm = nn.LayerNorm(embed_dim)
+        self.proj = conv_3xnxn(
+            in_chans,
+            embed_dim,
+            kernel_size=patch_size[0],
+            stride=patch_size[0])
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.proj(x)
+        B, _, T, H, W = x.shape
+        x = x.flatten(2).transpose(1, 2)
+        x = self.norm(x)
+        x = x.reshape(B, T, H, W, -1).permute(0, 4, 1, 2, 3).contiguous()
+        return x
+
+
+class PatchEmbed(BaseModule):
+    """Image to Patch Embedding.
+
+    Args:
+        img_size (int): Number of input size.
+            Defaults to 224.
+        patch_size (int): Number of patch size.
+            Defaults to 16.
+        in_chans (int): Number of input features.
+            Defaults to 3.
+        embed_dim (int): Number of output features.
+            Defaults to 768.
+        init_cfg (dict, optional): Config dict for initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        img_size=224,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        num_patches = (img_size[1] // patch_size[1]) * (
+            img_size[0] // patch_size[0])
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+        self.norm = nn.LayerNorm(embed_dim)
+        self.proj = conv_1xnxn(
+            in_chans,
+            embed_dim,
+            kernel_size=patch_size[0],
+            stride=patch_size[0])
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.proj(x)
+        B, _, T, H, W = x.shape
+        x = x.flatten(2).transpose(1, 2)
+        x = self.norm(x)
+        x = x.reshape(B, T, H, W, -1).permute(0, 4, 1, 2, 3).contiguous()
+        return x
+
+
+@MODELS.register_module()
+class UniFormer(BaseModule):
+    """UniFormer.
+
+    A pytorch implement of: `UniFormer: Unified Transformer
+    for Efficient Spatiotemporal Representation Learning
+    <https://arxiv.org/abs/2201.04676>`
+
+    Args:
+        depth (List[int]): List of depth in each stage.
+            Defaults to [5, 8, 20, 7].
+        img_size (int): Number of input size.
+            Defaults to 224.
+        in_chans (int): Number of input features.
+            Defaults to 3.
+        head_dim (int): Dimension of attention head.
+            Defaults to 64.
+        embed_dim (List[int]): List of embedding dimension in each layer.
+            Defaults to [64, 128, 320, 512].
+        mlp_ratio (float): Ratio of mlp hidden dimension
+            to embedding dimension. Defaults to 4.
+        qkv_bias (bool): If True, add a learnable bias to query, key, value.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        drop_rate (float): Dropout rate. Defaults to 0.0.
+        attn_drop_rate (float): Attention dropout rate. Defaults to 0.0.
+        drop_path_rate (float): Stochastic depth rates.
+            Defaults to 0.0.
+        clip_pretrained (bool): Whether to load pretrained CLIP visual encoder.
+            Defaults to True.
+        pretrained (str): Name of pretrained model.
+            Defaults to None.
+        init_cfg (dict or list[dict]): Initialization config dict. Defaults to
+            ``[
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+            ]``.
+    """
+
+    def __init__(
+        self,
+        depth: List[int] = [5, 8, 20, 7],
+        img_size: int = 224,
+        in_chans: int = 3,
+        embed_dim: List[int] = [64, 128, 320, 512],
+        head_dim: int = 64,
+        mlp_ratio: float = 4.,
+        qkv_bias: bool = True,
+        qk_scale: float = None,
+        drop_rate: float = 0.,
+        attn_drop_rate: float = 0.,
+        drop_path_rate: float = 0.,
+        clip_pretrained: bool = True,
+        pretrained: Optional[str] = None,
+        init_cfg: Optional[Union[Dict, List[Dict]]] = [
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+        ]
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.pretrained = pretrained
+        self.clip_pretrained = clip_pretrained
+        self.patch_embed1 = SpeicalPatchEmbed(
+            img_size=img_size,
+            patch_size=4,
+            in_chans=in_chans,
+            embed_dim=embed_dim[0])
+        self.patch_embed2 = PatchEmbed(
+            img_size=img_size // 4,
+            patch_size=2,
+            in_chans=embed_dim[0],
+            embed_dim=embed_dim[1])
+        self.patch_embed3 = PatchEmbed(
+            img_size=img_size // 8,
+            patch_size=2,
+            in_chans=embed_dim[1],
+            embed_dim=embed_dim[2])
+        self.patch_embed4 = PatchEmbed(
+            img_size=img_size // 16,
+            patch_size=2,
+            in_chans=embed_dim[2],
+            embed_dim=embed_dim[3])
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, sum(depth))
+        ]  # stochastic depth decay rule
+        num_heads = [dim // head_dim for dim in embed_dim]
+        self.blocks1 = ModuleList([
+            CBlock(
+                dim=embed_dim[0],
+                mlp_ratio=mlp_ratio,
+                drop=drop_rate,
+                drop_path=dpr[i]) for i in range(depth[0])
+        ])
+        self.blocks2 = ModuleList([
+            CBlock(
+                dim=embed_dim[1],
+                mlp_ratio=mlp_ratio,
+                drop=drop_rate,
+                drop_path=dpr[i + depth[0]]) for i in range(depth[1])
+        ])
+        self.blocks3 = ModuleList([
+            SABlock(
+                dim=embed_dim[2],
+                num_heads=num_heads[2],
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop=drop_rate,
+                attn_drop=attn_drop_rate,
+                drop_path=dpr[i + depth[0] + depth[1]])
+            for i in range(depth[2])
+        ])
+        self.blocks4 = ModuleList([
+            SABlock(
+                dim=embed_dim[3],
+                num_heads=num_heads[3],
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop=drop_rate,
+                attn_drop=attn_drop_rate,
+                drop_path=dpr[i + depth[0] + depth[1] + depth[2]])
+            for i in range(depth[3])
+        ])
+        self.norm = bn_3d(embed_dim[-1])
+
+    def _inflate_weight(self,
+                        weight_2d: torch.Tensor,
+                        time_dim: int,
+                        center: bool = True) -> torch.Tensor:
+        logger.info(f'Init center: {center}')
+        if center:
+            weight_3d = torch.zeros(*weight_2d.shape)
+            weight_3d = weight_3d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
+            middle_idx = time_dim // 2
+            weight_3d[:, :, middle_idx, :, :] = weight_2d
+        else:
+            weight_3d = weight_2d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
+            weight_3d = weight_3d / time_dim
+        return weight_3d
+
+    def _load_pretrained(self, pretrained: str = None) -> None:
+        """Load ImageNet-1K pretrained model.
+
+        The model is pretrained with ImageNet-1K.
+        https://github.com/Sense-X/UniFormer
+
+        Args:
+            pretrained (str): Model name of ImageNet-1K pretrained model.
+                Defaults to None.
+        """
+        if pretrained is not None:
+            model_path = _MODELS[pretrained]
+            logger.info(f'Load ImageNet pretrained model from {model_path}')
+            state_dict = _load_checkpoint(model_path, map_location='cpu')
+            state_dict_3d = self.state_dict()
+            for k in state_dict.keys():
+                if k in state_dict_3d.keys(
+                ) and state_dict[k].shape != state_dict_3d[k].shape:
+                    if len(state_dict_3d[k].shape) <= 2:
+                        logger.info(f'Ignore: {k}')
+                        continue
+                    logger.info(f'Inflate: {k}, {state_dict[k].shape}' +
+                                f' => {state_dict_3d[k].shape}')
+                    time_dim = state_dict_3d[k].shape[2]
+                    state_dict[k] = self._inflate_weight(
+                        state_dict[k], time_dim)
+            self.load_state_dict(state_dict, strict=False)
+
+    def init_weights(self):
+        """Initialize the weights in backbone."""
+        if self.clip_pretrained:
+            logger = MMLogger.get_current_instance()
+            logger.info(f'load model from: {self.pretrained}')
+            self._load_pretrained(self.pretrained)
+        else:
+            if self.pretrained:
+                self.init_cfg = dict(
+                    type='Pretrained', checkpoint=self.pretrained)
+            super().init_weights()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.patch_embed1(x)
+        x = self.pos_drop(x)
+        for blk in self.blocks1:
+            x = blk(x)
+        x = self.patch_embed2(x)
+        for blk in self.blocks2:
+            x = blk(x)
+        x = self.patch_embed3(x)
+        for blk in self.blocks3:
+            x = blk(x)
+        x = self.patch_embed4(x)
+        for blk in self.blocks4:
+            x = blk(x)
+        x = self.norm(x)
+        return x
diff --git a/mmaction/models/backbones/uniformerv2.py b/mmaction/models/backbones/uniformerv2.py
new file mode 100644
index 0000000000..64b0ba8faf
--- /dev/null
+++ b/mmaction/models/backbones/uniformerv2.py
@@ -0,0 +1,596 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+from collections import OrderedDict
+from typing import Dict, List, Optional, Union
+
+import torch
+from mmcv.cnn.bricks import DropPath
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModule, ModuleList
+from mmengine.runner.checkpoint import _load_checkpoint
+from torch import nn
+
+from mmaction.registry import MODELS
+
+logger = MMLogger.get_current_instance()
+
+MODEL_PATH = 'https://download.openmmlab.com/mmaction/v1.0/recognition'
+_MODELS = {
+    'ViT-B/16':
+    os.path.join(MODEL_PATH, 'uniformerv2/clipVisualEncoder',
+                 'vit-base-p16-res224_clip-rgb_20221219-b8a5da86.pth'),
+    'ViT-L/14':
+    os.path.join(MODEL_PATH, 'uniformerv2/clipVisualEncoder',
+                 'vit-large-p14-res224_clip-rgb_20221219-9de7543e.pth'),
+    'ViT-L/14_336':
+    os.path.join(MODEL_PATH, 'uniformerv2/clipVisualEncoder',
+                 'vit-large-p14-res336_clip-rgb_20221219-d370f9e5.pth'),
+}
+
+
+class QuickGELU(BaseModule):
+    """Quick GELU function. Forked from https://github.com/openai/CLIP/blob/d50
+    d76daa670286dd6cacf3bcd80b5e4823fc8e1/clip/model.py.
+
+    Args:
+        x (torch.Tensor): The input features of shape :math:`(B, N, C)`.
+    """
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return x * torch.sigmoid(1.702 * x)
+
+
+class Local_MHRA(BaseModule):
+    """Local MHRA.
+
+    Args:
+        d_model (int): Number of input channels.
+        dw_reduction (float): Downsample ratio of input channels.
+            Defaults to 1.5.
+        pos_kernel_size (int): Kernel size of local MHRA.
+            Defaults to 3.
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        d_model: int,
+        dw_reduction: float = 1.5,
+        pos_kernel_size: int = 3,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        padding = pos_kernel_size // 2
+        re_d_model = int(d_model // dw_reduction)
+        self.pos_embed = nn.Sequential(
+            nn.BatchNorm3d(d_model),
+            nn.Conv3d(d_model, re_d_model, kernel_size=1, stride=1, padding=0),
+            nn.Conv3d(
+                re_d_model,
+                re_d_model,
+                kernel_size=(pos_kernel_size, 1, 1),
+                stride=(1, 1, 1),
+                padding=(padding, 0, 0),
+                groups=re_d_model),
+            nn.Conv3d(re_d_model, d_model, kernel_size=1, stride=1, padding=0),
+        )
+
+        # init zero
+        logger.info('Init zero for Conv in pos_emb')
+        nn.init.constant_(self.pos_embed[3].weight, 0)
+        nn.init.constant_(self.pos_embed[3].bias, 0)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.pos_embed(x)
+
+
+class ResidualAttentionBlock(BaseModule):
+    """Local UniBlock.
+
+    Args:
+        d_model (int): Number of input channels.
+        n_head (int): Number of attention head.
+        drop_path (float): Stochastic depth rate.
+            Defaults to 0.0.
+        dw_reduction (float): Downsample ratio of input channels.
+            Defaults to 1.5.
+        no_lmhra (bool): Whether removing local MHRA.
+            Defaults to False.
+        double_lmhra (bool): Whether using double local MHRA.
+            Defaults to True.
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        d_model: int,
+        n_head: int,
+        drop_path: float = 0.0,
+        dw_reduction: float = 1.5,
+        no_lmhra: bool = False,
+        double_lmhra: bool = True,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.n_head = n_head
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        logger.info(f'Drop path rate: {drop_path}')
+
+        self.no_lmhra = no_lmhra
+        self.double_lmhra = double_lmhra
+        logger.info(f'No L_MHRA: {no_lmhra}')
+        logger.info(f'Double L_MHRA: {double_lmhra}')
+        if not no_lmhra:
+            self.lmhra1 = Local_MHRA(d_model, dw_reduction=dw_reduction)
+            if double_lmhra:
+                self.lmhra2 = Local_MHRA(d_model, dw_reduction=dw_reduction)
+
+        # spatial
+        self.attn = nn.MultiheadAttention(d_model, n_head)
+        self.ln_1 = nn.LayerNorm(d_model)
+        self.mlp = nn.Sequential(
+            OrderedDict([('c_fc', nn.Linear(d_model, d_model * 4)),
+                         ('gelu', QuickGELU()),
+                         ('c_proj', nn.Linear(d_model * 4, d_model))]))
+        self.ln_2 = nn.LayerNorm(d_model)
+
+    def attention(self, x: torch.Tensor) -> torch.Tensor:
+        return self.attn(x, x, x, need_weights=False, attn_mask=None)[0]
+
+    def forward(self, x: torch.Tensor, T: int = 8) -> torch.Tensor:
+        # x: 1+HW, NT, C
+        if not self.no_lmhra:
+            # Local MHRA
+            tmp_x = x[1:, :, :]
+            L, NT, C = tmp_x.shape
+            N = NT // T
+            H = W = int(L**0.5)
+            tmp_x = tmp_x.view(H, W, N, T, C).permute(2, 4, 3, 0,
+                                                      1).contiguous()
+            tmp_x = tmp_x + self.drop_path(self.lmhra1(tmp_x))
+            tmp_x = tmp_x.view(N, C, T,
+                               L).permute(3, 0, 2,
+                                          1).contiguous().view(L, NT, C)
+            x = torch.cat([x[:1, :, :], tmp_x], dim=0)
+        # MHSA
+        x = x + self.drop_path(self.attention(self.ln_1(x)))
+        # Local MHRA
+        if not self.no_lmhra and self.double_lmhra:
+            tmp_x = x[1:, :, :]
+            tmp_x = tmp_x.view(H, W, N, T, C).permute(2, 4, 3, 0,
+                                                      1).contiguous()
+            tmp_x = tmp_x + self.drop_path(self.lmhra2(tmp_x))
+            tmp_x = tmp_x.view(N, C, T,
+                               L).permute(3, 0, 2,
+                                          1).contiguous().view(L, NT, C)
+            x = torch.cat([x[:1, :, :], tmp_x], dim=0)
+        # FFN
+        x = x + self.drop_path(self.mlp(self.ln_2(x)))
+        return x
+
+
+class Extractor(BaseModule):
+    """Global UniBlock.
+
+    Args:
+        d_model (int): Number of input channels.
+        n_head (int): Number of attention head.
+        mlp_factor (float): Ratio of hidden dimensions in MLP layers.
+            Defaults to 4.0.
+        drop_out (float): Stochastic dropout rate.
+            Defaults to 0.0.
+        drop_path (float): Stochastic depth rate.
+            Defaults to 0.0.
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        d_model: int,
+        n_head: int,
+        mlp_factor: float = 4.0,
+        dropout: float = 0.0,
+        drop_path: float = 0.0,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        logger.info(f'Drop path rate: {drop_path}')
+        self.attn = nn.MultiheadAttention(d_model, n_head)
+        self.ln_1 = nn.LayerNorm(d_model)
+        d_mlp = round(mlp_factor * d_model)
+        self.mlp = nn.Sequential(
+            OrderedDict([('c_fc', nn.Linear(d_model, d_mlp)),
+                         ('gelu', QuickGELU()),
+                         ('dropout', nn.Dropout(dropout)),
+                         ('c_proj', nn.Linear(d_mlp, d_model))]))
+        self.ln_2 = nn.LayerNorm(d_model)
+        self.ln_3 = nn.LayerNorm(d_model)
+
+        # zero init
+        nn.init.xavier_uniform_(self.attn.in_proj_weight)
+        nn.init.constant_(self.attn.out_proj.weight, 0.)
+        nn.init.constant_(self.attn.out_proj.bias, 0.)
+        nn.init.xavier_uniform_(self.mlp[0].weight)
+        nn.init.constant_(self.mlp[-1].weight, 0.)
+        nn.init.constant_(self.mlp[-1].bias, 0.)
+
+    def attention(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
+        d_model = self.ln_1.weight.size(0)
+        q = (x @ self.attn.in_proj_weight[:d_model].T
+             ) + self.attn.in_proj_bias[:d_model]
+
+        k = (y @ self.attn.in_proj_weight[d_model:-d_model].T
+             ) + self.attn.in_proj_bias[d_model:-d_model]
+        v = (y @ self.attn.in_proj_weight[-d_model:].T
+             ) + self.attn.in_proj_bias[-d_model:]
+        Tx, Ty, N = q.size(0), k.size(0), q.size(1)
+        q = q.view(Tx, N, self.attn.num_heads,
+                   self.attn.head_dim).permute(1, 2, 0, 3)
+        k = k.view(Ty, N, self.attn.num_heads,
+                   self.attn.head_dim).permute(1, 2, 0, 3)
+        v = v.view(Ty, N, self.attn.num_heads,
+                   self.attn.head_dim).permute(1, 2, 0, 3)
+        aff = (q @ k.transpose(-2, -1) / (self.attn.head_dim**0.5))
+
+        aff = aff.softmax(dim=-1)
+        out = aff @ v
+        out = out.permute(2, 0, 1, 3).flatten(2)
+        out = self.attn.out_proj(out)
+        return out
+
+    def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
+        x = x + self.drop_path(self.attention(self.ln_1(x), self.ln_3(y)))
+        x = x + self.drop_path(self.mlp(self.ln_2(x)))
+        return x
+
+
+class Transformer(BaseModule):
+    """Backbone:
+
+    Args:
+        width (int): Number of input channels in local UniBlock.
+        layers (int): Number of layers of local UniBlock.
+        heads (int): Number of attention head in local UniBlock.
+        backbone_drop_path_rate (float): Stochastic depth rate
+            in local UniBlock. Defaults to 0.0.
+        t_size (int): Number of temporal dimension after patch embedding.
+            Defaults to 8.
+        dw_reduction (float): Downsample ratio of input channels in local MHRA.
+            Defaults to 1.5.
+        no_lmhra (bool): Whether removing local MHRA in local UniBlock.
+            Defaults to False.
+        double_lmhra (bool): Whether using double local MHRA
+            in local UniBlock. Defaults to True.
+        return_list (List[int]): Layer index of input features
+            for global UniBlock. Defaults to [8, 9, 10, 11].
+        n_dim (int): Number of layers of global UniBlock.
+            Defaults to 4.
+        n_dim (int): Number of layers of global UniBlock.
+            Defaults to 4.
+        n_dim (int): Number of input channels in global UniBlock.
+            Defaults to 768.
+        n_head (int): Number of attention head in global UniBlock.
+            Defaults to 12.
+        mlp_factor (float): Ratio of hidden dimensions in MLP layers
+            in global UniBlock. Defaults to 4.0.
+        drop_path_rate (float): Stochastic depth rate in global UniBlock.
+            Defaults to 0.0.
+        mlp_dropout (List[float]): Stochastic dropout rate in each MLP layer
+            in global UniBlock. Defaults to [0.5, 0.5, 0.5, 0.5].
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        width: int,
+        layers: int,
+        heads: int,
+        backbone_drop_path_rate: float = 0.,
+        t_size: int = 8,
+        dw_reduction: float = 1.5,
+        no_lmhra: bool = True,
+        double_lmhra: bool = False,
+        return_list: List[int] = [8, 9, 10, 11],
+        n_layers: int = 4,
+        n_dim: int = 768,
+        n_head: int = 12,
+        mlp_factor: float = 4.0,
+        drop_path_rate: float = 0.,
+        mlp_dropout: List[float] = [0.5, 0.5, 0.5, 0.5],
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.T = t_size
+        self.return_list = return_list
+        # backbone
+        b_dpr = [
+            x.item()
+            for x in torch.linspace(0, backbone_drop_path_rate, layers)
+        ]
+        self.resblocks = ModuleList([
+            ResidualAttentionBlock(
+                width,
+                heads,
+                drop_path=b_dpr[i],
+                dw_reduction=dw_reduction,
+                no_lmhra=no_lmhra,
+                double_lmhra=double_lmhra,
+            ) for i in range(layers)
+        ])
+
+        # global block
+        assert n_layers == len(return_list)
+        self.temporal_cls_token = nn.Parameter(torch.zeros(1, 1, n_dim))
+        self.dpe = ModuleList([
+            nn.Conv3d(
+                n_dim,
+                n_dim,
+                kernel_size=3,
+                stride=1,
+                padding=1,
+                bias=True,
+                groups=n_dim) for _ in range(n_layers)
+        ])
+        for m in self.dpe:
+            nn.init.constant_(m.bias, 0.)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, n_layers)]
+        self.dec = ModuleList([
+            Extractor(
+                n_dim,
+                n_head,
+                mlp_factor=mlp_factor,
+                dropout=mlp_dropout[i],
+                drop_path=dpr[i],
+            ) for i in range(n_layers)
+        ])
+        # weight sum
+        self.norm = nn.LayerNorm(n_dim)
+        self.balance = nn.Parameter(torch.zeros((n_dim)))
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        T_down = self.T
+        L, NT, C = x.shape
+        N = NT // T_down
+        H = W = int((L - 1)**0.5)
+        cls_token = self.temporal_cls_token.repeat(1, N, 1)
+
+        j = -1
+        for i, resblock in enumerate(self.resblocks):
+            x = resblock(x, T_down)
+            if i in self.return_list:
+                j += 1
+                tmp_x = x.clone()
+                tmp_x = tmp_x.view(L, N, T_down, C)
+                # dpe
+                _, tmp_feats = tmp_x[:1], tmp_x[1:]
+                tmp_feats = tmp_feats.permute(1, 3, 2,
+                                              0).reshape(N, C, T_down, H, W)
+                tmp_feats = self.dpe[j](tmp_feats.clone()).view(
+                    N, C, T_down, L - 1).permute(3, 0, 2, 1).contiguous()
+                tmp_x[1:] = tmp_x[1:] + tmp_feats
+                # global block
+                tmp_x = tmp_x.permute(2, 0, 1, 3).flatten(0, 1)  # T * L, N, C
+                cls_token = self.dec[j](cls_token, tmp_x)
+
+        weight = self.sigmoid(self.balance)
+        residual = x.view(L, N, T_down, C)[0].mean(1)  # L, N, T, C
+        out = self.norm((1 - weight) * cls_token[0, :, :] + weight * residual)
+        return out
+
+
+@MODELS.register_module()
+class UniFormerV2(BaseModule):
+    """UniFormerV2:
+
+    A pytorch implement of: `UniFormerV2: Spatiotemporal
+    Learning by Arming Image ViTs with Video UniFormer
+    <https://arxiv.org/abs/2211.09552>`
+
+    Args:
+        input_resolution (int): Number of input resolution.
+            Defaults to 224.
+        patch_size (int): Number of patch size.
+            Defaults to 16.
+        width (int): Number of input channels in local UniBlock.
+            Defaults to 768.
+        layers (int): Number of layers of local UniBlock.
+            Defaults to 12.
+        heads (int): Number of attention head in local UniBlock.
+            Defaults to 12.
+        backbone_drop_path_rate (float): Stochastic depth rate
+            in local UniBlock. Defaults to 0.0.
+        t_size (int): Number of temporal dimension after patch embedding.
+            Defaults to 8.
+        temporal_downsample (bool): Whether downsampling temporal dimentison.
+            Defaults to False.
+        dw_reduction (float): Downsample ratio of input channels in local MHRA.
+            Defaults to 1.5.
+        no_lmhra (bool): Whether removing local MHRA in local UniBlock.
+            Defaults to False.
+        double_lmhra (bool): Whether using double local MHRA in local UniBlock.
+            Defaults to True.
+        return_list (List[int]): Layer index of input features
+            for global UniBlock. Defaults to [8, 9, 10, 11].
+        n_dim (int): Number of layers of global UniBlock.
+            Defaults to 4.
+        n_dim (int): Number of layers of global UniBlock.
+            Defaults to 4.
+        n_dim (int): Number of input channels in global UniBlock.
+            Defaults to 768.
+        n_head (int): Number of attention head in global UniBlock.
+            Defaults to 12.
+        mlp_factor (float): Ratio of hidden dimensions in MLP layers
+            in global UniBlock. Defaults to 4.0.
+        drop_path_rate (float): Stochastic depth rate in global UniBlock.
+            Defaults to 0.0.
+        mlp_dropout (List[float]): Stochastic dropout rate in each MLP layer
+            in global UniBlock. Defaults to [0.5, 0.5, 0.5, 0.5].
+        clip_pretrained (bool): Whether to load pretrained CLIP visual encoder.
+            Defaults to True.
+        pretrained (str): Name of pretrained model.
+            Defaults to None.
+        init_cfg (dict or list[dict]): Initialization config dict. Defaults to
+            ``[
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+            ]``.
+    """
+
+    def __init__(
+        self,
+        # backbone
+        input_resolution: int = 224,
+        patch_size: int = 16,
+        width: int = 768,
+        layers: int = 12,
+        heads: int = 12,
+        backbone_drop_path_rate: float = 0.,
+        t_size: int = 8,
+        kernel_size: int = 3,
+        dw_reduction: float = 1.5,
+        temporal_downsample: bool = False,
+        no_lmhra: bool = True,
+        double_lmhra: bool = False,
+        # global block
+        return_list: List[int] = [8, 9, 10, 11],
+        n_layers: int = 4,
+        n_dim: int = 768,
+        n_head: int = 12,
+        mlp_factor: float = 4.0,
+        drop_path_rate: float = 0.,
+        mlp_dropout: List[float] = [0.5, 0.5, 0.5, 0.5],
+        # pretrain
+        clip_pretrained: bool = True,
+        pretrained: Optional[str] = None,
+        init_cfg: Optional[Union[Dict, List[Dict]]] = [
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+        ]
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.pretrained = pretrained
+        self.clip_pretrained = clip_pretrained
+        self.input_resolution = input_resolution
+        padding = (kernel_size - 1) // 2
+        if temporal_downsample:
+            self.conv1 = nn.Conv3d(
+                3,
+                width, (kernel_size, patch_size, patch_size),
+                (2, patch_size, patch_size), (padding, 0, 0),
+                bias=False)
+            t_size = t_size // 2
+        else:
+            self.conv1 = nn.Conv3d(
+                3,
+                width, (1, patch_size, patch_size),
+                (1, patch_size, patch_size), (0, 0, 0),
+                bias=False)
+
+        scale = width**-0.5
+        self.class_embedding = nn.Parameter(scale * torch.randn(width))
+        self.positional_embedding = nn.Parameter(scale * torch.randn(
+            (input_resolution // patch_size)**2 + 1, width))
+        self.ln_pre = nn.LayerNorm(width)
+
+        self.transformer = Transformer(
+            width,
+            layers,
+            heads,
+            dw_reduction=dw_reduction,
+            backbone_drop_path_rate=backbone_drop_path_rate,
+            t_size=t_size,
+            no_lmhra=no_lmhra,
+            double_lmhra=double_lmhra,
+            return_list=return_list,
+            n_layers=n_layers,
+            n_dim=n_dim,
+            n_head=n_head,
+            mlp_factor=mlp_factor,
+            drop_path_rate=drop_path_rate,
+            mlp_dropout=mlp_dropout,
+        )
+
+    def _inflate_weight(self,
+                        weight_2d: torch.Tensor,
+                        time_dim: int,
+                        center: bool = True) -> torch.Tensor:
+        logger.info(f'Init center: {center}')
+        if center:
+            weight_3d = torch.zeros(*weight_2d.shape)
+            weight_3d = weight_3d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
+            middle_idx = time_dim // 2
+            weight_3d[:, :, middle_idx, :, :] = weight_2d
+        else:
+            weight_3d = weight_2d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
+            weight_3d = weight_3d / time_dim
+        return weight_3d
+
+    def _load_pretrained(self, pretrained: str = None) -> None:
+        """Load CLIP pretrained visual encoder.
+
+        The visual encoder is extracted from CLIP.
+        https://github.com/openai/CLIP
+
+        Args:
+            pretrained (str): Model name of pretrained CLIP visual encoder.
+                Defaults to None.
+        """
+        if pretrained is not None:
+            model_path = _MODELS[pretrained]
+            logger.info(f'Load CLIP pretrained model from {model_path}')
+            state_dict = _load_checkpoint(model_path, map_location='cpu')
+            state_dict_3d = self.state_dict()
+            for k in state_dict.keys():
+                if k in state_dict_3d.keys(
+                ) and state_dict[k].shape != state_dict_3d[k].shape:
+                    if len(state_dict_3d[k].shape) <= 2:
+                        logger.info(f'Ignore: {k}')
+                        continue
+                    logger.info(f'Inflate: {k}, {state_dict[k].shape}' +
+                                f' => {state_dict_3d[k].shape}')
+                    time_dim = state_dict_3d[k].shape[2]
+                    state_dict[k] = self._inflate_weight(
+                        state_dict[k], time_dim)
+            self.load_state_dict(state_dict, strict=False)
+
+    def init_weights(self):
+        """Initialize the weights in backbone."""
+        if self.clip_pretrained:
+            logger = MMLogger.get_current_instance()
+            logger.info(f'load model from: {self.pretrained}')
+            self._load_pretrained(self.pretrained)
+        else:
+            if self.pretrained:
+                self.init_cfg = dict(
+                    type='Pretrained', checkpoint=self.pretrained)
+            super().init_weights()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.conv1(x)  # shape = [*, width, grid, grid]
+        N, C, T, H, W = x.shape
+        x = x.permute(0, 2, 3, 4, 1).reshape(N * T, H * W, C)
+
+        x = torch.cat([
+            self.class_embedding.to(x.dtype) + torch.zeros(
+                x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x
+        ],
+                      dim=1)  # shape = [*, grid ** 2 + 1, width]
+        x = x + self.positional_embedding.to(x.dtype)
+        x = self.ln_pre(x)
+
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        out = self.transformer(x)
+        return out
diff --git a/mmaction/models/heads/mvit_head.py b/mmaction/models/heads/mvit_head.py
index 3797bb616d..dfdbe5f781 100644
--- a/mmaction/models/heads/mvit_head.py
+++ b/mmaction/models/heads/mvit_head.py
@@ -23,6 +23,7 @@ class MViTHead(BaseHead):
             Defaults to `dict(type='CrossEntropyLoss')`.
         dropout_ratio (float): Probability of dropout layer. Default: 0.5.
         init_std (float): Std value for Initiation. Defaults to 0.02.
+        init_scale (float): Scale factor for Initiation parameters. Default: 1.
         kwargs (dict, optional): Any keyword argument to be used to initialize
             the head.
     """
@@ -33,9 +34,11 @@ def __init__(self,
                  loss_cls: ConfigType = dict(type='CrossEntropyLoss'),
                  dropout_ratio: float = 0.5,
                  init_std: float = 0.02,
+                 init_scale: float = 1.0,
                  **kwargs) -> None:
         super().__init__(num_classes, in_channels, loss_cls, **kwargs)
         self.init_std = init_std
+        self.init_scale = init_scale
         self.dropout_ratio = dropout_ratio
         if self.dropout_ratio != 0:
             self.dropout = nn.Dropout(p=self.dropout_ratio)
@@ -47,6 +50,8 @@ def init_weights(self) -> None:
         """Initiate the parameters from scratch."""
         trunc_normal_init(self.fc_cls.weight, std=self.init_std)
         constant_init(self.fc_cls.bias, 0.02)
+        self.fc_cls.weight.data.mul_(self.init_scale)
+        self.fc_cls.bias.data.mul_(self.init_scale)
 
     def pre_logits(self, feats: Tuple[List[Tensor]]) -> Tensor:
         """The process before the final classification head.
diff --git a/mmaction/models/heads/timesformer_head.py b/mmaction/models/heads/timesformer_head.py
index f70dd24300..f72fdc90c5 100644
--- a/mmaction/models/heads/timesformer_head.py
+++ b/mmaction/models/heads/timesformer_head.py
@@ -17,6 +17,8 @@ class TimeSformerHead(BaseHead):
         loss_cls (dict or ConfigDict): Config for building loss.
             Defaults to `dict(type='CrossEntropyLoss')`.
         init_std (float): Std value for Initiation. Defaults to 0.02.
+        dropout_ratio (float): Probability of dropout layer.
+            Defaults to : 0.0.
         kwargs (dict, optional): Any keyword argument to be used to initialize
             the head.
     """
@@ -26,9 +28,16 @@ def __init__(self,
                  in_channels: int,
                  loss_cls: ConfigType = dict(type='CrossEntropyLoss'),
                  init_std: float = 0.02,
+                 dropout_ratio: float = 0.0,
                  **kwargs) -> None:
         super().__init__(num_classes, in_channels, loss_cls, **kwargs)
         self.init_std = init_std
+        self.dropout_ratio = dropout_ratio
+
+        if self.dropout_ratio != 0:
+            self.dropout = nn.Dropout(p=self.dropout_ratio)
+        else:
+            self.dropout = None
         self.fc_cls = nn.Linear(self.in_channels, self.num_classes)
 
     def init_weights(self) -> None:
@@ -45,6 +54,9 @@ def forward(self, x: Tensor, **kwargs) -> Tensor:
             Tensor: The classification scores for input samples.
         """
         # [N, in_channels]
+        if self.dropout is not None:
+            x = self.dropout(x)
+        # [N, in_channels]
         cls_score = self.fc_cls(x)
         # [N, num_classes]
         return cls_score
diff --git a/mmaction/registry.py b/mmaction/registry.py
index db56340ed9..28d237daa8 100644
--- a/mmaction/registry.py
+++ b/mmaction/registry.py
@@ -10,6 +10,7 @@
 from mmengine.registry import DATASETS as MMENGINE_DATASETS
 from mmengine.registry import EVALUATOR as MMENGINE_EVALUATOR
 from mmengine.registry import HOOKS as MMENGINE_HOOKS
+from mmengine.registry import INFERENCERS as MMENGINE_INFERENCERS
 from mmengine.registry import LOG_PROCESSORS as MMENGINE_LOG_PROCESSORS
 from mmengine.registry import LOOPS as MMENGINE_LOOPS
 from mmengine.registry import METRICS as MMENGINE_METRICS
@@ -32,52 +33,97 @@
 from mmengine.registry import Registry
 
 # manage all kinds of runners like `EpochBasedRunner` and `IterBasedRunner`
-RUNNERS = Registry('runner', parent=MMENGINE_RUNNERS)
+RUNNERS = Registry(
+    'runner', parent=MMENGINE_RUNNERS, locations=['mmaction.engine.runner'])
 # manage runner constructors that define how to initialize runners
 RUNNER_CONSTRUCTORS = Registry(
-    'runner constructor', parent=MMENGINE_RUNNER_CONSTRUCTORS)
+    'runner constructor',
+    parent=MMENGINE_RUNNER_CONSTRUCTORS,
+    locations=['mmaction.engine.runner'])
 # manage all kinds of loops like `EpochBasedTrainLoop`
-LOOPS = Registry('loop', parent=MMENGINE_LOOPS)
+LOOPS = Registry(
+    'loop', parent=MMENGINE_LOOPS, locations=['mmaction.engine.runner'])
 # manage all kinds of hooks like `CheckpointHook`
-HOOKS = Registry('hook', parent=MMENGINE_HOOKS)
+HOOKS = Registry(
+    'hook', parent=MMENGINE_HOOKS, locations=['mmaction.engine.hooks'])
 
 # manage data-related modules
-DATASETS = Registry('dataset', parent=MMENGINE_DATASETS)
-DATA_SAMPLERS = Registry('data sampler', parent=MMENGINE_DATA_SAMPLERS)
-TRANSFORMS = Registry('transform', parent=MMENGINE_TRANSFORMS)
+DATASETS = Registry(
+    'dataset', parent=MMENGINE_DATASETS, locations=['mmaction.datasets'])
+DATA_SAMPLERS = Registry(
+    'data sampler',
+    parent=MMENGINE_DATA_SAMPLERS,
+    locations=['mmaction.engine'])
+TRANSFORMS = Registry(
+    'transform',
+    parent=MMENGINE_TRANSFORMS,
+    locations=['mmaction.datasets.transforms'])
 
 # manage all kinds of modules inheriting `nn.Module`
-MODELS = Registry('model', parent=MMENGINE_MODELS)
+MODELS = Registry(
+    'model', parent=MMENGINE_MODELS, locations=['mmaction.models'])
 # manage all kinds of model wrappers like 'MMDistributedDataParallel'
-MODEL_WRAPPERS = Registry('model_wrapper', parent=MMENGINE_MODEL_WRAPPERS)
+MODEL_WRAPPERS = Registry(
+    'model_wrapper',
+    parent=MMENGINE_MODEL_WRAPPERS,
+    locations=['mmaction.models'])
 # manage all kinds of weight initialization modules like `Uniform`
 WEIGHT_INITIALIZERS = Registry(
-    'weight initializer', parent=MMENGINE_WEIGHT_INITIALIZERS)
+    'weight initializer',
+    parent=MMENGINE_WEIGHT_INITIALIZERS,
+    locations=['mmaction.models'])
 
 # manage all kinds of optimizers like `SGD` and `Adam`
-OPTIMIZERS = Registry('optimizer', parent=MMENGINE_OPTIMIZERS)
+OPTIMIZERS = Registry(
+    'optimizer',
+    parent=MMENGINE_OPTIMIZERS,
+    locations=['mmaction.engine.optimizers'])
 # manage optimizer wrapper
-OPTIM_WRAPPERS = Registry('optim_wrapper', parent=MMENGINE_OPTIM_WRAPPERS)
+OPTIM_WRAPPERS = Registry(
+    'optim_wrapper',
+    parent=MMENGINE_OPTIM_WRAPPERS,
+    locations=['mmaction.engine.optimizers'])
 # manage constructors that customize the optimization hyperparameters.
 OPTIM_WRAPPER_CONSTRUCTORS = Registry(
     'optimizer wrapper constructor',
-    parent=MMENGINE_OPTIM_WRAPPER_CONSTRUCTORS)
+    parent=MMENGINE_OPTIM_WRAPPER_CONSTRUCTORS,
+    locations=['mmaction.engine.optimizers'])
 # manage all kinds of parameter schedulers like `MultiStepLR`
 PARAM_SCHEDULERS = Registry(
-    'parameter scheduler', parent=MMENGINE_PARAM_SCHEDULERS)
+    'parameter scheduler',
+    parent=MMENGINE_PARAM_SCHEDULERS,
+    locations=['mmaction.engine'])
 
 # manage all kinds of metrics
-METRICS = Registry('metric', parent=MMENGINE_METRICS)
+METRICS = Registry(
+    'metric', parent=MMENGINE_METRICS, locations=['mmaction.evaluation'])
 # manage evaluator
-EVALUATOR = Registry('evaluator', parent=MMENGINE_EVALUATOR)
+EVALUATOR = Registry(
+    'evaluator', parent=MMENGINE_EVALUATOR, locations=['mmaction.evaluation'])
 
 # manage task-specific modules like anchor generators and box coders
-TASK_UTILS = Registry('task util', parent=MMENGINE_TASK_UTILS)
+TASK_UTILS = Registry(
+    'task util', parent=MMENGINE_TASK_UTILS, locations=['mmaction.models'])
 
 # manage visualizer
-VISUALIZERS = Registry('visualizer', parent=MMENGINE_VISUALIZERS)
+VISUALIZERS = Registry(
+    'visualizer',
+    parent=MMENGINE_VISUALIZERS,
+    locations=['mmaction.visualization'])
 # manage visualizer backend
-VISBACKENDS = Registry('vis_backend', parent=MMENGINE_VISBACKENDS)
+VISBACKENDS = Registry(
+    'vis_backend',
+    parent=MMENGINE_VISBACKENDS,
+    locations=['mmaction.visualization'])
 
 # manage logprocessor
-LOG_PROCESSORS = Registry('log_processor', parent=MMENGINE_LOG_PROCESSORS)
+LOG_PROCESSORS = Registry(
+    'log_processor',
+    parent=MMENGINE_LOG_PROCESSORS,
+    locations=['mmaction.engine'])
+
+# manage inferencer
+INFERENCERS = Registry(
+    'inferencer',
+    parent=MMENGINE_INFERENCERS,
+    locations=['mmaction.apis.inferencers'])
diff --git a/mmaction/version.py b/mmaction/version.py
index be3f0959a7..5a0a756926 100644
--- a/mmaction/version.py
+++ b/mmaction/version.py
@@ -1,6 +1,6 @@
 # Copyright (c) Open-MMLab. All rights reserved.
 
-__version__ = '1.0.0rc2'
+__version__ = '1.0.0rc3'
 
 
 def parse_version_info(version_str: str):
diff --git a/mmaction/visualization/action_visualizer.py b/mmaction/visualization/action_visualizer.py
index fba9d6c600..48c595fd5b 100644
--- a/mmaction/visualization/action_visualizer.py
+++ b/mmaction/visualization/action_visualizer.py
@@ -1,13 +1,11 @@
 # Copyright (c) OpenMMLab. All rights reserved.
-import os
 import os.path as osp
-import warnings
 from typing import Dict, List, Optional, Sequence, Tuple, Union
 
-import matplotlib.pyplot as plt
 import mmcv
 import numpy as np
 from mmengine.dist import master_only
+from mmengine.fileio.io import isdir, isfile, join_path, list_dir_or_file
 from mmengine.visualization import Visualizer
 
 from mmaction.registry import VISBACKENDS, VISUALIZERS
@@ -45,11 +43,6 @@ class ActionVisualizer(Visualizer):
 
     Args:
         name (str): Name of the instance. Defaults to 'visualizer'.
-        video (Union[np.ndarray, Sequence[np.ndarray]]):
-            the origin video to draw. The format should be RGB.
-            For np.ndarray input, the video shape should be (N, H, W, C).
-            For Sequence[np.ndarray] input, the shape of each frame in
-             the sequence should be (H, W, C).
         vis_backends (list, optional): Visual backend config list.
             Defaults to None.
         save_dir (str, optional): Save file dir for all storage backends.
@@ -89,65 +82,65 @@ class ActionVisualizer(Visualizer):
     def __init__(
         self,
         name='visualizer',
-        video: Optional[np.ndarray] = None,
         vis_backends: Optional[List[Dict]] = None,
         save_dir: Optional[str] = None,
         fig_save_cfg=dict(frameon=False),
-        fig_show_cfg=dict(frameon=False, num='show')
+        fig_show_cfg=dict(frameon=False)
     ) -> None:
-        self._dataset_meta = None
-        self._vis_backends = dict()
-
-        if save_dir is None:
-            warnings.warn('`Visualizer` backend is not initialized '
-                          'because save_dir is None.')
-        elif vis_backends is not None:
-            assert len(vis_backends) > 0, 'empty list'
-            names = [
-                vis_backend.get('name', None) for vis_backend in vis_backends
-            ]
-            if None in names:
-                if len(set(names)) > 1:
-                    raise RuntimeError(
-                        'If one of them has a name attribute, '
-                        'all backends must use the name attribute')
-                else:
-                    type_names = [
-                        vis_backend['type'] for vis_backend in vis_backends
-                    ]
-                    if len(set(type_names)) != len(type_names):
-                        raise RuntimeError(
-                            'The same vis backend cannot exist in '
-                            '`vis_backend` config. '
-                            'Please specify the name field.')
-
-            if None not in names and len(set(names)) != len(names):
-                raise RuntimeError('The name fields cannot be the same')
+        super().__init__(
+            name=name,
+            image=None,
+            vis_backends=vis_backends,
+            save_dir=save_dir,
+            fig_save_cfg=fig_save_cfg,
+            fig_show_cfg=fig_show_cfg)
+
+    def _load_video(self,
+                    video: Union[np.ndarray, Sequence[np.ndarray], str],
+                    target_resolution: Optional[Tuple[int]] = None):
+        """Load video from multiple source and convert to target resolution.
 
-            save_dir = osp.join(save_dir, 'vis_data')
-
-            for vis_backend in vis_backends:
-                name = vis_backend.pop('name', vis_backend['type'])
-                vis_backend.setdefault('save_dir', save_dir)
-                self._vis_backends[name] = VISBACKENDS.build(vis_backend)
-
-        self.is_inline = 'inline' in plt.get_backend()
+        Args:
+            video (np.ndarray, str): The video to draw.
+            target_resolution (Tuple[int], optional): Set to
+                (desired_width desired_height) to have resized frames. If
+                either dimension is None, the frames are resized by keeping
+                the existing aspect ratio. Defaults to None.
+        """
+        if isinstance(video, np.ndarray) or isinstance(video, list):
+            frames = video
+        elif isinstance(video, str):
+            # video file path
+            if isfile(video):
+                try:
+                    import decord
+                except ImportError:
+                    raise ImportError(
+                        'Please install decord to load video file.')
+                video = decord.VideoReader(video)
+                frames = [x.asnumpy()[..., ::-1] for x in video]
+            # rawframes folder path
+            elif isdir(video):
+                frame_list = sorted(list_dir_or_file(video, list_dir=False))
+                frames = [mmcv.imread(join_path(video, x)) for x in frame_list]
+        else:
+            raise TypeError(f'type of video {type(video)} not supported')
 
-        self.fig_save = None
-        self.fig_show = None
-        self.fig_save_num = fig_save_cfg.get('num', None)
-        self.fig_show_num = fig_show_cfg.get('num', None)
-        self.fig_save_cfg = fig_save_cfg
-        self.fig_show_cfg = fig_show_cfg
+        if target_resolution is not None:
+            w, h = target_resolution
+            frame_h, frame_w, _ = frames[0].shape
+            if w == -1:
+                w = int(h / frame_h * frame_w)
+            if h == -1:
+                h = int(w / frame_w * frame_h)
+            frames = [mmcv.imresize(f, (w, h)) for f in frames]
 
-        (self.fig_save_canvas, self.fig_save,
-         self.ax_save) = self._initialize_fig(fig_save_cfg)
-        self.dpi = self.fig_save.get_dpi()
+        return frames
 
     @master_only
     def add_datasample(self,
                        name: str,
-                       video: Union[np.ndarray, Sequence[np.ndarray]],
+                       video: Union[np.ndarray, Sequence[np.ndarray], str],
                        data_sample: Optional[ActionDataSample] = None,
                        draw_gt: bool = True,
                        draw_pred: bool = True,
@@ -156,18 +149,22 @@ def add_datasample(self,
                        show_frames: bool = False,
                        text_cfg: dict = dict(),
                        wait_time: float = 0.1,
-                       out_folder: Optional[str] = None,
-                       step: int = 0) -> None:
+                       out_path: Optional[str] = None,
+                       out_type: str = 'img',
+                       target_resolution: Optional[Tuple[int]] = None,
+                       step: int = 0,
+                       fps: int = 4) -> None:
         """Draw datasample and save to all backends.
 
-        - If ``out_folder`` is specified, all storage backends are ignored
-          and save the videos to the ``out_folder``.
+        - If ``out_path`` is specified, all storage backends are ignored
+          and save the videos to the ``out_path``.
         - If ``show_frames`` is True, plot the frames in a window sequentially,
           please confirm you are able to access the graphical interface.
 
         Args:
             name (str): The frame identifier.
-            video (np.ndarray): The video to draw.
+            video (np.ndarray, str): The video to draw. supports decoded
+                np.ndarray, video file path, rawframes folder path.
             data_sample (:obj:`ActionDataSample`, optional): The annotation of
                 the frame. Defaults to None.
             draw_gt (bool): Whether to draw ground truth labels.
@@ -185,14 +182,21 @@ def add_datasample(self,
                 Defaults to an empty dict.
             wait_time (float): Delay in seconds. 0 is the special
                 value that means "forever". Defaults to 0.1.
-            out_folder (str, optional): Extra folder to save the visualization
+            out_path (str, optional): Extra folder to save the visualization
                 result. If specified, the visualizer will only save the result
-                frame to the out_folder and ignore its storage backends.
+                frame to the out_path and ignore its storage backends.
                 Defaults to None.
+            out_type (str): Output format type, choose from 'img', 'gif',
+                'video'. Defaults to ``'img'``.
+            target_resolution (Tuple[int], optional): Set to
+                (desired_width desired_height) to have resized frames. If
+                either dimension is None, the frames are resized by keeping
+                the existing aspect ratio. Defaults to None.
             step (int): Global step value to record. Defaults to 0.
+            fps (int): Frames per second for saving video. Defaults to 4.
         """
         classes = None
-        wait_time_in_milliseconds = wait_time * 10**6
+        video = self._load_video(video, target_resolution)
         tol_video = len(video)
 
         if self.dataset_meta is not None:
@@ -256,24 +260,41 @@ def add_datasample(self,
             drawn_img = self.get_image()
             resulted_video.append(drawn_img)
 
-            if show_frames:
-                self.show(
-                    drawn_img,
-                    win_name=frame_name,
-                    wait_time=wait_time_in_milliseconds)
+        if show_frames:
+            frame_wait_time = 1. / fps
+            for frame_idx, drawn_img in enumerate(resulted_video):
+                frame_name = 'frame %d of %s' % (frame_idx + 1, name)
+                if frame_idx < len(resulted_video) - 1:
+                    wait_time = frame_wait_time
+                else:
+                    wait_time = wait_time
+                self.show(drawn_img, win_name=frame_name, wait_time=wait_time)
 
         resulted_video = np.array(resulted_video)
-        if out_folder is not None:
-            resulted_video = resulted_video[..., ::-1]
-            os.makedirs(out_folder, exist_ok=True)
-            # save the frame to the target file instead of vis_backends
-            for frame_idx, frame in enumerate(resulted_video):
-                mmcv.imwrite(frame, out_folder + '/%d.png' % frame_idx)
+        if out_path is not None:
+            save_dir, save_name = osp.split(out_path)
+            vis_backend_cfg = dict(type='LocalVisBackend', save_dir=save_dir)
+            tmp_local_vis_backend = VISBACKENDS.build(vis_backend_cfg)
+            tmp_local_vis_backend.add_video(
+                save_name,
+                resulted_video,
+                step=step,
+                fps=fps,
+                out_type=out_type)
         else:
-            self.add_video(name, resulted_video, step=step)
+            self.add_video(
+                name, resulted_video, step=step, fps=fps, out_type=out_type)
+        return resulted_video
 
     @master_only
-    def add_video(self, name: str, image: np.ndarray, step: int = 0) -> None:
+    def add_video(
+        self,
+        name: str,
+        image: np.ndarray,
+        step: int = 0,
+        fps: int = 4,
+        out_type: str = 'img',
+    ) -> None:
         """Record the image.
 
         Args:
@@ -281,6 +302,11 @@ def add_video(self, name: str, image: np.ndarray, step: int = 0) -> None:
             image (np.ndarray, optional): The image to be saved. The format
                 should be RGB. Default to None.
             step (int): Global step value to record. Default to 0.
+            fps (int): Frames per second for saving video. Defaults to 4.
+            out_type (str): Output format type, choose from 'img', 'gif',
+                'video'. Defaults to ``'img'``.
         """
         for vis_backend in self._vis_backends.values():
-            vis_backend.add_video(name, image, step)  # type: ignore
+            vis_backend.add_video(
+                name, image, step=step, fps=fps,
+                out_type=out_type)  # type: ignore
diff --git a/mmaction/visualization/video_backend.py b/mmaction/visualization/video_backend.py
index 9b6366650e..c32ee8988e 100644
--- a/mmaction/visualization/video_backend.py
+++ b/mmaction/visualization/video_backend.py
@@ -1,6 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import os
 import os.path as osp
+from typing import Optional
 
 import cv2
 import numpy as np
@@ -21,41 +22,15 @@ class LocalVisBackend(LocalVisBackend):
     """Local visualization backend class with video support.
 
     See mmengine.visualization.LocalVisBackend for more details.
-
-    Args:
-        save_dir (str, optional): The root directory to save the files
-            produced by the visualizer. If it is none, it means no data
-            is stored.
-        img_save_dir (str): The directory to save images.
-            Defaults to ``'vis_image'``.
-        config_save_file (str): The file name to save config.
-            Defaults to ``'config.py'``.
-        scalar_save_file (str):  The file name to save scalar values.
-            Defaults to ``'scalars.json'``.
-        out_type (str): Output format type, choose from 'img', 'gif',
-            'video'. Defaults to ``'img'``.
-        fps (int): Frames per second for saving video. Defaults to 5.
     """
 
-    def __init__(
-        self,
-        save_dir: str,
-        img_save_dir: str = 'vis_image',
-        config_save_file: str = 'config.py',
-        scalar_save_file: str = 'scalars.json',
-        out_type: str = 'img',
-        fps: int = 5,
-    ):
-        super().__init__(save_dir, img_save_dir, config_save_file,
-                         scalar_save_file)
-        self.out_type = out_type
-        self.fps = fps
-
     @force_init_env
     def add_video(self,
                   name: str,
                   frames: np.ndarray,
                   step: int = 0,
+                  fps: Optional[int] = 4,
+                  out_type: Optional[int] = 'img',
                   **kwargs) -> None:
         """Record the frames of a video to disk.
 
@@ -64,10 +39,13 @@ def add_video(self,
             frames (np.ndarray): The frames to be saved. The format
                 should be RGB. The shape should be (T, H, W, C).
             step (int): Global step value to record. Defaults to 0.
+            out_type (str): Output format type, choose from 'img', 'gif',
+            'video'. Defaults to ``'img'``.
+            fps (int): Frames per second for saving video. Defaults to 4.
         """
         assert frames.dtype == np.uint8
 
-        if self.out_type == 'img':
+        if out_type == 'img':
             frames_dir = osp.join(self._save_dir, name, f'frames_{step}')
             os.makedirs(frames_dir, exist_ok=True)
             for idx, frame in enumerate(frames):
@@ -82,12 +60,12 @@ def add_video(self,
                                   'output file.')
 
             frames = [x[..., ::-1] for x in frames]
-            video_clips = ImageSequenceClip(frames, fps=self.fps)
+            video_clips = ImageSequenceClip(frames, fps=fps)
             name = osp.splitext(name)[0]
-            if self.out_type == 'gif':
+            if out_type == 'gif':
                 out_path = osp.join(self._save_dir, name + '.gif')
                 video_clips.write_gif(out_path, logger=None)
-            elif self.out_type == 'video':
+            elif out_type == 'video':
                 out_path = osp.join(self._save_dir, name + '.mp4')
                 video_clips.write_videofile(
                     out_path, remove_temp=True, logger=None)
diff --git a/requirements/mminstall.txt b/requirements/mminstall.txt
index cc624e4490..8381c8c000 100644
--- a/requirements/mminstall.txt
+++ b/requirements/mminstall.txt
@@ -1,2 +1,2 @@
 mmcv>=2.0.0rc0,<2.1.0
-mmengine>=0.3.0
+mmengine>=0.5.0,<1.0.0
diff --git a/tests/apis/test_inference.py b/tests/apis/test_inference.py
index 883132aec5..6ecd89cb67 100644
--- a/tests/apis/test_inference.py
+++ b/tests/apis/test_inference.py
@@ -8,14 +8,10 @@
 
 from mmaction.apis import inference_recognizer, init_recognizer
 from mmaction.structures import ActionDataSample
-from mmaction.utils import register_all_modules
 
 
 class TestInference(TestCase):
 
-    def setUp(self):
-        register_all_modules()
-
     @parameterized.expand([(('configs/recognition/tsn/'
                              'tsn_imagenet-pretrained-r50_8xb32-'
                              '1x1x3-100e_kinetics400-rgb.py'), ('cpu', 'cuda'))
diff --git a/tests/apis/test_inferencer.py b/tests/apis/test_inferencer.py
new file mode 100644
index 0000000000..58270975fe
--- /dev/null
+++ b/tests/apis/test_inferencer.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from tempfile import TemporaryDirectory
+from unittest import TestCase
+
+import torch
+from parameterized import parameterized
+
+from mmaction.apis import MMAction2Inferencer
+
+
+class TestMMActionInferencer(TestCase):
+
+    def test_init_recognizer(self):
+        # Initialzied by alias
+        _ = MMAction2Inferencer(rec='tsn')
+
+        # Initialzied by config
+        _ = MMAction2Inferencer(
+            rec='tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb'
+        )  # noqa: E501
+
+        with self.assertRaisesRegex(ValueError,
+                                    'rec algorithm should provided.'):
+            _ = MMAction2Inferencer()
+
+    @parameterized.expand([
+        (('tsn'), ('tools/data/kinetics/label_map_k400.txt'),
+         ('demo/demo.mp4'), ('cpu', 'cuda'))
+    ])
+    def test_infer_recognizer(self, config, label_file, video_path, devices):
+        with TemporaryDirectory() as tmp_dir:
+            for device in devices:
+                if device == 'cuda' and not torch.cuda.is_available():
+                    # Skip the test if cuda is required but unavailable
+                    continue
+
+                # test video file input and return datasample
+                inferencer = MMAction2Inferencer(
+                    config, label_file=label_file, device=device)
+                results = inferencer(video_path, vid_out_dir=tmp_dir)
+                self.assertIn('predictions', results)
+                self.assertIn('visualization', results)
+                assert osp.exists(osp.join(tmp_dir, osp.basename(video_path)))
+
+                results = inferencer(
+                    video_path, vid_out_dir=tmp_dir, out_type='gif')
+                self.assertIsInstance(results['predictions'][0], dict)
+                assert osp.exists(
+                    osp.join(tmp_dir,
+                             osp.basename(video_path).replace('mp4', 'gif')))
+
+                # test np.ndarray input
+                inferencer = MMAction2Inferencer(
+                    config,
+                    label_file=label_file,
+                    device=device,
+                    input_format='array')
+                import decord
+                import numpy as np
+                video = decord.VideoReader(video_path)
+                frames = [x.asnumpy()[..., ::-1] for x in video]
+                frames = np.stack(frames)
+                inferencer(frames, vid_out_dir=tmp_dir)
+                assert osp.exists(osp.join(tmp_dir, '00000000.mp4'))
diff --git a/tests/models/backbones/test_uniformer.py b/tests/models/backbones/test_uniformer.py
new file mode 100644
index 0000000000..4f47cf5728
--- /dev/null
+++ b/tests/models/backbones/test_uniformer.py
@@ -0,0 +1,21 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from mmaction.models import UniFormer
+from mmaction.testing import generate_backbone_demo_inputs
+
+
+def test_uniformer_backbone():
+    """Test uniformer backbone."""
+    input_shape = (1, 3, 16, 64, 64)
+    imgs = generate_backbone_demo_inputs(input_shape)
+
+    model = UniFormer(
+        depth=[3, 4, 8, 3],
+        embed_dim=[64, 128, 320, 512],
+        head_dim=64,
+        drop_path_rate=0.1)
+    model.init_weights()
+
+    model.eval()
+    assert model(imgs).shape == torch.Size([1, 512, 8, 2, 2])
diff --git a/tests/models/backbones/test_uniformerv2.py b/tests/models/backbones/test_uniformerv2.py
new file mode 100644
index 0000000000..3345892eb7
--- /dev/null
+++ b/tests/models/backbones/test_uniformerv2.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from mmaction.models import UniFormerV2
+from mmaction.testing import generate_backbone_demo_inputs
+
+
+def test_uniformerv2_backbone():
+    """Test uniformer backbone."""
+    input_shape = (1, 3, 8, 64, 64)
+    imgs = generate_backbone_demo_inputs(input_shape)
+
+    model = UniFormerV2(
+        input_resolution=64,
+        patch_size=16,
+        width=768,
+        layers=12,
+        heads=12,
+        t_size=8,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=False,
+        no_lmhra=True,
+        double_lmhra=True,
+        return_list=[8, 9, 10, 11],
+        n_layers=4,
+        n_dim=768,
+        n_head=12,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5])
+    model.init_weights()
+
+    model.eval()
+    assert model(imgs).shape == torch.Size([1, 768])
+
+    # SthSth
+    input_shape = (1, 3, 16, 64, 64)
+    imgs = generate_backbone_demo_inputs(input_shape)
+
+    model = UniFormerV2(
+        input_resolution=64,
+        patch_size=16,
+        width=768,
+        layers=12,
+        heads=12,
+        t_size=16,
+        dw_reduction=1.5,
+        backbone_drop_path_rate=0.,
+        temporal_downsample=True,
+        no_lmhra=False,
+        double_lmhra=True,
+        return_list=[8, 9, 10, 11],
+        n_layers=4,
+        n_dim=768,
+        n_head=12,
+        mlp_factor=4.,
+        drop_path_rate=0.,
+        mlp_dropout=[0.5, 0.5, 0.5, 0.5])
+    model.init_weights()
+
+    model.eval()
+    assert model(imgs).shape == torch.Size([1, 768])
diff --git a/tests/visualization/test_video_backend.py b/tests/visualization/test_video_backend.py
index 5f75377b83..0de82465ee 100644
--- a/tests/visualization/test_video_backend.py
+++ b/tests/visualization/test_video_backend.py
@@ -1,14 +1,20 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import os
+import os.path as osp
+import time
 from pathlib import Path
+from tempfile import TemporaryDirectory
 
 import decord
 import torch
 from mmengine.structures import LabelData
 
 from mmaction.structures import ActionDataSample
+from mmaction.utils import register_all_modules
 from mmaction.visualization import ActionVisualizer
 
+register_all_modules()
+
 
 def test_local_visbackend():
     video = decord.VideoReader('./demo/demo.mp4')
@@ -16,18 +22,18 @@ def test_local_visbackend():
 
     data_sample = ActionDataSample()
     data_sample.gt_labels = LabelData(item=torch.tensor([2]))
-
-    vis = ActionVisualizer(
-        save_dir='./outputs', vis_backends=[dict(type='LocalVisBackend')])
-    vis.add_datasample('demo', video, data_sample)
-    for k in range(32):
-        frame_path = 'outputs/vis_data/demo/frames_0/%d.png' % k
-        assert Path(frame_path).exists()
-
-    vis.add_datasample('demo', video, data_sample, step=1)
-    for k in range(32):
-        frame_path = 'outputs/vis_data/demo/frames_1/%d.png' % k
-        assert Path(frame_path).exists()
+    with TemporaryDirectory() as tmp_dir:
+        vis = ActionVisualizer(
+            save_dir=tmp_dir, vis_backends=[dict(type='LocalVisBackend')])
+        vis.add_datasample('demo', video, data_sample)
+        for k in range(32):
+            frame_path = osp.join(tmp_dir, 'vis_data/demo/frames_0/%d.png' % k)
+            assert Path(frame_path).exists()
+
+        vis.add_datasample('demo', video, data_sample, step=1)
+        for k in range(32):
+            frame_path = osp.join(tmp_dir, 'vis_data/demo/frames_1/%d.png' % k)
+            assert Path(frame_path).exists()
     return
 
 
@@ -37,19 +43,21 @@ def test_tensorboard_visbackend():
 
     data_sample = ActionDataSample()
     data_sample.gt_labels = LabelData(item=torch.tensor([2]))
-
-    vis = ActionVisualizer(
-        save_dir='./outputs',
-        vis_backends=[dict(type='TensorboardVisBackend')])
-    vis.add_datasample('demo', video, data_sample, step=1)
-
-    assert Path('outputs/vis_data/').exists()
-    flag = False
-    for item in os.listdir('outputs/vis_data/'):
-        if item.startswith('events.out.tfevents.'):
-            flag = True
-            break
-    assert flag, 'Cannot find tensorboard file!'
+    with TemporaryDirectory() as tmp_dir:
+        vis = ActionVisualizer(
+            save_dir=tmp_dir,
+            vis_backends=[dict(type='TensorboardVisBackend')])
+        vis.add_datasample('demo', video, data_sample, step=1)
+
+        assert Path(osp.join(tmp_dir, 'vis_data')).exists()
+        flag = False
+        for item in os.listdir(osp.join(tmp_dir, 'vis_data')):
+            if item.startswith('events.out.tfevents.'):
+                flag = True
+                break
+        assert flag, 'Cannot find tensorboard file!'
+        # wait tensorboard store asynchronously
+        time.sleep(1)
     return
 
 
diff --git a/tools/analysis_tools/check_videos.py b/tools/analysis_tools/check_videos.py
index 87e26980e9..2bd0a9ffca 100644
--- a/tools/analysis_tools/check_videos.py
+++ b/tools/analysis_tools/check_videos.py
@@ -7,11 +7,9 @@
 
 import numpy as np
 from mmengine import Config, DictAction, track_parallel_progress
+from mmengine.registry import init_default_scope
 
 from mmaction.registry import DATASETS, TRANSFORMS
-from mmaction.utils import register_all_modules
-
-register_all_modules()
 
 
 def parse_args():
@@ -115,6 +113,7 @@ def _do_check_videos(lock, pipeline, output_file, data_info):
     # read config file
     cfg = Config.fromfile(args.config)
     cfg.merge_from_dict(args.cfg_options)
+    init_default_scope(cfg.get('default_scope', 'mmaction'))
 
     # build dataset
     dataset_cfg = cfg.get(f'{args.split}_dataloader').dataset
diff --git a/tools/analysis_tools/eval_metric.py b/tools/analysis_tools/eval_metric.py
index 1a110154c5..08b6da31e2 100644
--- a/tools/analysis_tools/eval_metric.py
+++ b/tools/analysis_tools/eval_metric.py
@@ -4,8 +4,7 @@
 import mmengine
 from mmengine import Config, DictAction
 from mmengine.evaluator import Evaluator
-
-from mmaction.utils import register_all_modules
+from mmengine.registry import init_default_scope
 
 
 def parse_args():
@@ -30,12 +29,11 @@ def parse_args():
 def main():
     args = parse_args()
 
-    register_all_modules()
-
     # load config
     cfg = Config.fromfile(args.config)
     if args.cfg_options is not None:
         cfg.merge_from_dict(args.cfg_options)
+    init_default_scope(cfg.get('default_scope', 'mmaction'))
 
     data_samples = mmengine.load(args.pkl_results)
 
diff --git a/tools/analysis_tools/get_flops.py b/tools/analysis_tools/get_flops.py
index 6f5a6454db..b89f5db5ad 100644
--- a/tools/analysis_tools/get_flops.py
+++ b/tools/analysis_tools/get_flops.py
@@ -12,9 +12,9 @@
           'to set up the environment')
 from fvcore.nn.print_model_statistics import _format_size
 from mmengine import Config
+from mmengine.registry import init_default_scope
 
 from mmaction.registry import MODELS
-from mmaction.utils import register_all_modules
 
 
 def parse_args():
@@ -48,8 +48,8 @@ def main():
         raise ValueError('invalid input shape')
 
     cfg = Config.fromfile(args.config)
+    init_default_scope(cfg.get('default_scope', 'mmaction'))
 
-    register_all_modules()
     model = MODELS.build(cfg.model)
     model.eval()
 
diff --git a/tools/deployment/export_onnx_stdet.py b/tools/deployment/export_onnx_stdet.py
index 39a3b3ead4..fc587dbff0 100644
--- a/tools/deployment/export_onnx_stdet.py
+++ b/tools/deployment/export_onnx_stdet.py
@@ -6,10 +6,10 @@
 import torch.nn as nn
 from mmdet.structures.bbox import bbox2roi
 from mmengine import Config
+from mmengine.registry import init_default_scope
 from mmengine.runner import load_checkpoint
 
 from mmaction.registry import MODELS
-from mmaction.utils import register_all_modules
 
 
 def parse_args():
@@ -124,8 +124,8 @@ def forward(self, input_tensor, rois):
 
 def main():
     args = parse_args()
-    register_all_modules()
     config = Config.fromfile(args.config)
+    init_default_scope(config.get('default_scope', 'mmaction'))
 
     base_model = MODELS.build(config.model)
     load_checkpoint(base_model, args.checkpoint, map_location='cpu')
diff --git a/tools/test.py b/tools/test.py
index 341bf9f2c8..0d0d4bd20f 100644
--- a/tools/test.py
+++ b/tools/test.py
@@ -6,8 +6,6 @@
 from mmengine.config import Config, DictAction
 from mmengine.runner import Runner
 
-from mmaction.utils import register_all_modules
-
 
 def parse_args():
     parser = argparse.ArgumentParser(
@@ -91,10 +89,6 @@ def merge_args(cfg, args):
 def main():
     args = parse_args()
 
-    # register all modules in mmaction2 into the registries
-    # do not init the default scope here because it will be init in the runner
-    register_all_modules(init_default_scope=False)
-
     # load config
     cfg = Config.fromfile(args.config)
     cfg = merge_args(cfg, args)
diff --git a/tools/train.py b/tools/train.py
index e424a7a634..2c51c50709 100644
--- a/tools/train.py
+++ b/tools/train.py
@@ -6,8 +6,6 @@
 from mmengine.config import Config, DictAction
 from mmengine.runner import Runner
 
-from mmaction.utils import register_all_modules
-
 
 def parse_args():
     parser = argparse.ArgumentParser(description='Train a action recognizer')
@@ -121,10 +119,6 @@ def merge_args(cfg, args):
 def main():
     args = parse_args()
 
-    # register all modules in mmaction2 into the registries
-    # do not init the default scope here because it will be init in the runner
-    register_all_modules(init_default_scope=False)
-
     cfg = Config.fromfile(args.config)
 
     # merge cli arguments to config
diff --git a/tools/visualizations/browse_dataset.py b/tools/visualizations/browse_dataset.py
index 5247db19c2..e6cf9b82c4 100644
--- a/tools/visualizations/browse_dataset.py
+++ b/tools/visualizations/browse_dataset.py
@@ -9,11 +9,11 @@
 import numpy as np
 from mmengine.config import Config, DictAction
 from mmengine.dataset import Compose
+from mmengine.registry import init_default_scope
 from mmengine.utils import ProgressBar
 from mmengine.visualization import Visualizer
 
 from mmaction.registry import DATASETS
-from mmaction.utils import register_all_modules
 from mmaction.visualization import ActionVisualizer
 from mmaction.visualization.action_visualizer import _get_adaptive_scale
 
@@ -178,9 +178,7 @@ def main():
     cfg = Config.fromfile(args.config)
     if args.cfg_options is not None:
         cfg.merge_from_dict(args.cfg_options)
-
-    # register all modules in mmaction2 into the registries
-    register_all_modules()
+    init_default_scope(cfg.get('default_scope', 'mmaction'))
 
     dataset_cfg = cfg.get(args.phase + '_dataloader').get('dataset')
     dataset = DATASETS.build(dataset_cfg)
@@ -190,13 +188,10 @@ def main():
                                       intermediate_imgs)
 
     # init visualizer
-    vis_backends = [
-        dict(
-            type='LocalVisBackend',
-            out_type='video',
-            save_dir=args.output_dir,
-            fps=args.fps)
-    ]
+    vis_backends = [dict(
+        type='LocalVisBackend',
+        save_dir=args.output_dir,
+    )]
     visualizer = ActionVisualizer(
         vis_backends=vis_backends, save_dir='place_holder')
 
@@ -233,7 +228,8 @@ def main():
 
         file_id = f'video_{i}'
         video = [x[..., ::-1] for x in video]
-        visualizer.add_datasample(file_id, video, data_sample)
+        visualizer.add_datasample(
+            file_id, video, data_sample, fps=args.fps, out_type='video')
         progress_bar.update()
 
 
diff --git a/tools/visualizations/vis_cam.py b/tools/visualizations/vis_cam.py
index f816cce922..7d158cca80 100644
--- a/tools/visualizations/vis_cam.py
+++ b/tools/visualizations/vis_cam.py
@@ -11,7 +11,7 @@
 from mmengine.dataset import Compose, pseudo_collate
 
 from mmaction.apis import init_recognizer
-from mmaction.utils import GradCAM, register_all_modules
+from mmaction.utils import GradCAM
 
 
 def parse_args():
@@ -167,9 +167,6 @@ def _resize_frames(frame_list: List[np.ndarray],
 def main():
     args = parse_args()
 
-    # Register all modules in mmaction2 into the registries
-    register_all_modules()
-
     cfg = Config.fromfile(args.config)
     cfg.merge_from_dict(args.cfg_options)
 
diff --git a/tools/visualizations/vis_scheduler.py b/tools/visualizations/vis_scheduler.py
index 0d50c5191e..6e1b744862 100644
--- a/tools/visualizations/vis_scheduler.py
+++ b/tools/visualizations/vis_scheduler.py
@@ -12,12 +12,11 @@
 from mmengine.config import Config, DictAction
 from mmengine.hooks import Hook
 from mmengine.model import BaseModel
+from mmengine.registry import init_default_scope
 from mmengine.runner import Runner
 from mmengine.visualization import Visualizer
 from rich.progress import BarColumn, MofNCompleteColumn, Progress, TextColumn
 
-from mmaction.utils import register_all_modules
-
 
 class SimpleModel(BaseModel):
     """simple model that do nothing in train_step."""
@@ -206,8 +205,7 @@ def main():
                                 osp.splitext(osp.basename(args.config))[0])
 
     cfg.log_level = args.log_level
-    # register all modules in mmcls into the registries
-    register_all_modules()
+    init_default_scope(cfg.get('default_scope', 'mmaction'))
 
     # make sure save_root exists
     if args.save_path and not args.save_path.parent.exists():