-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the Temporal model #6
Comments
Hi @ygfrancois , thanks for your questions. In the original CLIP's model, they convert the model into FP16 (https://github.com/openai/CLIP/blob/main/clip/model.py#L434). If we keep this line for pretraining and disable mixed precision training, the loss will become NaN at some point. If we remove it and train with FP32 or mixed precision, the performance will be very bad on video retrieval task. To verify the problem ( Please let me know if you have any thoughts! |
FP16 and FP32 do have big difference when using temperature (logits scale before softmax). Maybe check the difference of the temperature setting between yours and pretrained CLIP ? |
Hi, thanks a lot for sharing your solid work, I have learned much from your paper and code. Here I still have a question about the part of temporal modeling.
I saw that you have compared the performance between Timesformer and XCLIP, which show that Timesformer works better, but in the paper of XCLIP, it used pretrained CLIP weights, and XCLIP found a trade-off way between keeping performance of pretrained CLIP weights and Temporal modeling.
I want to ask if you have test the performance of using XCLIP with pretrained CLIP, and did you found the way to used both Timesformer's temporal modeling and CLIP pretrained weights, which I think will beat XCLIP in theory. 😊
The text was updated successfully, but these errors were encountered: