You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In your paper, you mention that the candidates of the question in the sub-task【fine-grained action】 are generated using UMT-L. Could you please clarify whether you use a pre-trained UMT-L model to encode the videos and the 339 categories (the total number of categories in Moments in Time dataset), and then compute the text-visual similarity?
Thank you!
The text was updated successfully, but these errors were encountered:
Hello authors,
In your paper, you mention that the candidates of the question in the sub-task【fine-grained action】 are generated using UMT-L. Could you please clarify whether you use a pre-trained UMT-L model to encode the videos and the 339 categories (the total number of categories in Moments in Time dataset), and then compute the text-visual similarity?
Thank you!
The text was updated successfully, but these errors were encountered: