Highlights

We are excited to announce the release of MMAction2 1.0.0 as a part of the OpenMMLab 2.0 project! MMAction2 1.0.0 introduces an updated framework structure for the core package and a new section called Projects. This section showcases various engaging and versatile applications built upon the MMAction2 foundation.

In this latest release, we have significantly refactored the core package's code to make it clearer, more comprehensible, and disentangled. This has resulted in improved performance for several existing algorithms, ensuring that they now outperform their previous versions. Additionally, we have incorporated some cutting-edge algorithms, such as VideoSwin and VideoMAE, to further enhance the capabilities of MMAction2 and provide users with a more comprehensive and powerful toolkit. The new Projects section serves as an essential addition to MMAction2, created to foster innovation and collaboration among users. This section offers the following attractive features:

Flexible code contribution: Unlike the core package, the Projects section allows for a more flexible environment for code contributions, enabling faster integration of state-of-the-art models and features.
Showcase of diverse applications: Explore various projects built upon the MMAction2 foundation, such as deployment examples and combinations of video recognition with other tasks.
Fostering creativity and collaboration: Encourages users to experiment, build upon the MMAction2 platform, and share their innovative applications and techniques, creating an active community of developers and researchers. Discover the possibilities within the "Projects" section and join the vibrant MMAction2 community in pushing the boundaries of video understanding applications!

Exciting Features

RGBPoseConv3D

RGBPoseConv3D is a framework that jointly uses 2D human skeletons and RGB appearance for human action recognition. It is a 3D CNN with two streams, with the architecture borrowed from SlowFast. In RGBPoseConv3D:

The RGB stream corresponds to the slow stream in SlowFast; The Skeleton stream corresponds to the fast stream in SlowFast.
The input resolution of RGB frames is 4x larger than the pseudo heatmaps.
Bilateral connections are used for early feature fusion between the two modalities.

Supported by @Dai-Wenxun in #2182

Inferencer

In this release, we introduce the MMAction2Inferencer, which is a versatile API for the inference that supports multiple input types. The API enables users to easily specify and customize action recognition models, streamlining the process of performing video prediction using MMAction2.

Usage:

python demo/demo_inferencer.py ${INPUTS} [OPTIONS]

The INPUTS can be a video path or rawframes folder. For more detailed information on OPTIONS, please refer to Inferencer.

Example:

python demo/demo_inferencer.py zelda.mp4 --rec tsn --vid-out-dir zelda_out --label-file tools/data/kinetics/label_map_k400.txt

You can find the zelda.mp4 here. The output video is displayed below:

Clipchamp.mp4

Supported by @cir7 in #2164

List of Novel Features

MMAction2 V1.0 introduces support for new models and datasets in the field of video understanding, including MSG3D [Project] (CVPR'2020), CTRGCN [Project] (CVPR'2021), STGCN++ (Arxiv'2022), Video Swin Transformer (CVPR'2022), VideoMAE (NeurIPS'2022), C2D (CVPR'2018), MViT V2 (CVPR'2022), UniFormer V1 (ICLR'2022), and UniFormer V2 (Arxiv'2022), as well as the spatiotemporal action detection dataset AVA-Kinetics (Arxiv'2022).

Enhanced Omni-Source: We enhanced the original omni-source technique by dynamically adjusting 3D convolutional network architecture to simultaneously utilize videos and images for training. Taking the SlowOnlyR50 8x8 as an example, the Top-1 accuracy comparison of the three training methods illustrates that our omni-source training effectively employs the additional ImageNet dataset, significantly boosting performance on Kinetics400.

Mulit-Stream Skeleton Pipeline: In light of MMAction2's prior support for only joint and bone modalities, we have extended support to joint motion and bone motion modalities in MMAction2 V1.0. Furthermore, we have conducted training and evaluation for these four modalities using NTU60 2D and 3D keypoint data on STGCN, 2s-AGCN, and STGCN++.

Repeat Augment was initially proposed as a data augmentation method for ImageNet training and has been employed in recent Video Transformer works. Whenever a video is read during training, we use multiple (typically 2-4) random samples from the video for training. This approach not only enhances the model's generalization capability but also reduces the IO pressure of video reading. We support Repeat Augment in MMAction2 V1.0 and utilize this technique in MViT V2 training. The table below compares the Top-1 accuracy on Kinetics400 before and after employing Repeat Augment:

Bug Fixes

[Fix] Fix flip config of TSM for sth2sth v1/v2 dataset by @cir7 in #2247
[Fix] Fix circle ci by @cir7 in #2336 and #2334
[Fix] Fix accepting an unexpected argument local-rank in PyTorch 2.0 by @cir7 in #2320
[Fix] Fix TSM config link by @zyx-cv in #2315
[Fix] Fix numpy version requirement in CI by @hukkai in #2284
[Fix] Fix NTU pose extraction script by @cir7 in #2246
[Fix] Fix TSM-MobileNet V2 by @cir7 in #2332
[Fix] Fix command bugs in localization tasks' README by @hukkai in #2244
[Fix] Fix duplicate name in DecordInit and SampleAVAFrame by @cir7 in #2251
[Fix] Fix channel order when showing video by @cir7 in #2308
[Fix] Specify map_location to cpu when using _load_checkpoint by @Zheng-LinXiao in #2252

New Contributors

@Andy1621 made their first contribution in #2153
@zoe08 made their first contribution in #2188
@vansin made their first contribution in #2228
@Zheng-LinXiao made their first contribution in #2252

Full Changelog: v0.24.0...v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMAction2 V1.0.0 Release