Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval (ACM MM 2024 Poster)
(1) We developed Reversed in Time(RTime), a benchmark specifically designed to evaluate video-text retrieval models' temporal understanding capabilities via recognizing origin and reversed videos with their corresponding captions.
(2) Our experimental results demonstrate that current state-of-the-art models only perform at random levels in terms of temporal understanding.
(3) We finetuned UMT on RTime's training set to obtain UMT-neg, improves temporal understanding accuracy to approximately 55%.
To avoid potential copyright implications, we only provide original video urls, and you can download the corresponding videos after obtaining necessary permission from the source website.
For the validation and test sets, we do not use the rewrites generated by GPT-4 when we evaluate models. You can make additional experiments if you are interested.
If you find this repo helpful, please consider citing:
@inproceedings{Du2024RTime,
author = {Du, Yang and Liu, Yuqi and Jin, Qin},
title = {Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval},
pages = {5260–5269},
booktitle={Proceedings of the 32th {ACM} International Conference on Multimedia},
}