This repository holds the PyTorch implementation of the approach described in our report "M³T: Multi-Modal Multi-Task Learning for Continuous Valence-Arousal Estimation", which is used for our entry to ABAW Challenge 2020 (VA track). We provide models trained on Aff-Wild2.
- 2020.02.10: Initial public release
First, install dependencies
# clone project
git clone https://github.com/sailordiary/m3t.pytorch
python3 -m pip install -r requirements.txt --user
To evaluate on our pretrained models, first download the checkpoints from the release page, and run eval.py
to generate validation or test set predictions:
# download the checkpoint
wget
# to report CCC on the validation set
python3 eval.py --test_on_val --checkpoint m3t_mtl-vox2.pt
python3 get_smoothed_ccc predictions_val.pt
# to generate test set predictions
python3 eval.py --checkpoint m3t_mtl-vox2.pt
We use the Aff-Wild2 dataset. The raw videos are decoded with ffmpeg
, and passed to RetinaFace-ResNet50 for face detection. To extract log-Mel spectrogram energies, extract 16kHz mono wave files from audio tracks, and refer to process/extract_melspec.py
.
We provide the cropped-aligned face tracks (256x256, ~79G zipped) as well as pre-computed SENet-101 and TCAE features we use for our experiments here: [OneDrive]
Some files are still being uploaded at this moment. Please check the page again later.
Note that in addition to the 256-dimensional encoder features, we also saved 12 AU activation scores predicted by TCAE, which together are concatenated into a 268-dimensional vector for each video frame. We only used the encoder features for our experiments, but feel free to experiment with this extra information.
Coming soon...
@misc{zhang2020m3t,
title={$M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild},
author={Yuan-Hang Zhang and Rulin Huang and Jiabei Zeng and Shiguang Shan and Xilin Chen},
year={2020},
eprint={2002.02957},
archivePrefix={arXiv},
primaryClass={cs.CV}
}