You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hmm, that is not great. I've noticed that the per_device_train_batch_size to Training Arguments is commented out, so it's not actually training with a batch size of 1 (but I assume that you did try and train with that batch size before, and just commented things away once it didn't work).
Also - the truncate_dim is just post-processing, so it still runs the base model in the normal hidden size, probably 1024.
One option is to load the model itself in fp16 immediately:
Beyond that, your script looks totally normal. I'm a bit surprised at the very high memory usage.
I did some more digging:
jina-clip seems to be automatically loaded in bf16, whereas we train in fp16. This should have resulted in an immediate crash.
Flash Attention seems to be required, but you don't have triton installed which is also required for flash-attn (granted, maybe you installed it in a cell and then removed it), so you should have gotten an error there too.
Another common issue with notebooks is that old variables can be kept in memory even after rerunning a cell.
I'm struggling to get this to train well with fp16, lots of complaints about Half precision, or Half and Float precision not matching, etc.
Having said that, I'm able to train with full precision by casting the model to fp32 and then setting both fp16=False and bf16=False, with 1 batch size, on a 15GB T4:
Hi guys.
I am trying to fine-tune the clip model: jina-clip-v2 using my own image-sentence pairs.
I have an iterable dataset.
I am running on Colab, with 40GB of GPU RAM.
The model is loaded and occupies only 2GB of RAM.
When I start the training with my iterable dataset, the GPU memory exceeds 40GB of RAM.
Things I tried:
I am using the truncated model with only 64d embeddings only.
FP16 precision
Batch size of 1
accumulate gradient of 4
Even though I am still cannot train.
Anyone can help?
Here is the link of the jupyter notebook on Colab.
https://colab.research.google.com/drive/1sBhTSNSsZtTOli89t4HV4-T5kDkMfegx?usp=sharing
The text was updated successfully, but these errors were encountered: