Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the Temporal model #6

Open
ygfrancois opened this issue Mar 9, 2023 · 2 comments
Open

Question about the Temporal model #6

ygfrancois opened this issue Mar 9, 2023 · 2 comments

Comments

@ygfrancois
Copy link

Hi, thanks a lot for sharing your solid work, I have learned much from your paper and code. Here I still have a question about the part of temporal modeling.
I saw that you have compared the performance between Timesformer and XCLIP, which show that Timesformer works better, but in the paper of XCLIP, it used pretrained CLIP weights, and XCLIP found a trade-off way between keeping performance of pretrained CLIP weights and Temporal modeling.
I want to ask if you have test the performance of using XCLIP with pretrained CLIP, and did you found the way to used both Timesformer's temporal modeling and CLIP pretrained weights, which I think will beat XCLIP in theory. 😊

@klauscc
Copy link
Owner

klauscc commented Mar 13, 2023

Hi @ygfrancois , thanks for your questions.
Yes I agree with you. We do have tried to initialize the visual encoder and text encoder with CLIP's pretrained weights. However, we encountered some engineering problem that we didn't solve yet.

In the original CLIP's model, they convert the model into FP16 (https://github.com/openai/CLIP/blob/main/clip/model.py#L434). If we keep this line for pretraining and disable mixed precision training, the loss will become NaN at some point. If we remove it and train with FP32 or mixed precision, the performance will be very bad on video retrieval task.

To verify the problem ( convert_weights(model)), we also tried to finetune CLIP using CLIP4Clip's codebase with and without this line, the performance with FP32/mixed precision is ~3% worse than with FP16 on MSR-VTT dataset. We also posted this issue on CLIP4Clip's github (ArrowLuo/CLIP4Clip#96) but there is no response yet.
However, this seems not to be an issue on video classification work like XCLIP that removed this line.

Please let me know if you have any thoughts!

@ygfrancois
Copy link
Author

Hi @ygfrancois , thanks for your questions. Yes I agree with you. We do have tried to initialize the visual encoder and text encoder with CLIP's pretrained weights. However, we encountered some engineering problem that we didn't solve yet.

In the original CLIP's model, they convert the model into FP16 (https://github.com/openai/CLIP/blob/main/clip/model.py#L434). If we keep this line for pretraining and disable mixed precision training, the loss will become NaN at some point. If we remove it and train with FP32 or mixed precision, the performance will be very bad on video retrieval task.

To verify the problem ( convert_weights(model)), we also tried to finetune CLIP using CLIP4Clip's codebase with and without this line, the performance with FP32/mixed precision is ~3% worse than with FP16 on MSR-VTT dataset. We also posted this issue on CLIP4Clip's github (ArrowLuo/CLIP4Clip#96) but there is no response yet. However, this seems not to be an issue on video classification work like XCLIP that removed this line.

Please let me know if you have any thoughts!

FP16 and FP32 do have big difference when using temperature (logits scale before softmax). Maybe check the difference of the temperature setting between yours and pretrained CLIP ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants