You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for sharing your great work. I'm curious about the performance of your models on action recognition tasks. Have you attempted to benchmark on any standard action recognition tasks such as SSV2, K400/700?
Thank you.
The text was updated successfully, but these errors were encountered:
We take TimeSformer's codebase and use our video encoder as the initialization. The results on K400 is about 80%, about 2% higher than TimeSformer with the same architecture. However, very sadly it is much lower than CLIP's vision encoder (~85% if I remember correctly). This may indicate CLIP has stronger representations but our model has better video and text alignment.
Hi Authors,
Thank you for sharing your great work. I'm curious about the performance of your models on action recognition tasks. Have you attempted to benchmark on any standard action recognition tasks such as SSV2, K400/700?
Thank you.
The text was updated successfully, but these errors were encountered: