-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions concerning configuring train_lora.py for custom corpus #130
Comments
Hi, thanks for your detailed post. I'll go through your issues one by one. Indeed, you don't need to add As for the torch version, this depends a lot on the specific hardware you're using (including CUDA, cuDNN, OS, etc.). So I'm afraid there's not a lot I can do here. In our experiments, we used torch==1.7.1. In general, there's nothing wrong in using a newer version. The As for the questions:
2. Adding the extra embedding for newline feels like a hack and I am pretty sure the original SaT-3l I used as the base model should have had this in the embedding. Was the wrong model used as the base? And is this new embedding for newline changing enough given we freeze the weights of the original model? Furthermore, is this leaking into the TokenClassifier part of the model?
3. Is there a way to run fine-tuning on multiple GPUs?
4. Is the auxiliary objection function necessary for fine-tuning? I tried setting "use_auxiliary": false in the config.json file but ran into warnings while training and it errored while loading with SaT.
5. What is the difference between one_sample_per_line=True and one_sample_per_line=False? Does it actually make a difference on training results?
6. Is the optimal threshold returned by compute_metrics really the best threshold in your empirical experience?
7. Is corruption mandatory / desirable for validation? From my understanding the default LORA configs do not corrupt training samples but evaluate_sentence has this line
I hope this is helpful! |
Thank you for your detailed response! I have some responses and questions to your answers.
Yep! It all works out once the sigmoid is added. 1. Should we have newlines in our train / valid sentences?
I think it would be a good idea to make this explicit as a comment somewhere in train_lora.py or the README. This caused me a lot of confusion 😭 2. Adding the extra embedding for newline feels like a hack and I am pretty sure the original SaT-3l I used as the base model should have had this in the embedding. Was the wrong model used as the base? And is this new embedding for newline changing enough given we freeze the weights of the original model? Furthermore, is this leaking into the TokenClassifier part of the model?
This was my first instinct and I initially had a dataset with no newlines (but still errored). I just tried removing them again and I receive the thousands of
errors once again. 3. Is there a way to run fine-tuning on multiple GPUs?
Yeah I don't think I need the extra devices but it feels bad to leave compute power on the table :( 4. Is the auxiliary objection function necessary for fine-tuning? I tried setting "use_auxiliary": false in the config.json file but ran into warnings while training and it errored while loading with SaT.
Noted! 5. What is the difference between one_sample_per_line=True and one_sample_per_line=False? Does it actually make a difference on training results?
To clarify, my dataset is a bunch of documents that I would like to separate into sentences. Would one_sample_per_line=True be relevant here? 6. Is the optimal threshold returned by compute_metrics really the best threshold in your empirical experience?
Also noted, though I am curious how the auxiliary objective affects this, if at all. 7. Is corruption mandatory / desirable for validation? From my understanding the default LORA configs do not corrupt training samples but evaluate_sentence has this line
Got it! Thanks again! |
Thanks for the sigmoid/np.inf fix. I will adapt this later. 2. It is weird that you get these errors. Would it be possible for you to share (upload) your dataset (or a tiny subset) somewhere so I can try and debug it myself? 5. No, I suggest you keep it to |
Unfortunately I cannot share our dataset. I will try to reproduce the issue with a WIkipedia article. On a side note, how does SaT deal with newlines when it splits data with them? |
I understand that - that would be very helpful - whatever does the job! |
If I pass a document with newlines into SaT.split(), how does the model deal with the newlines? |
I see. The tokenizer takes care of this. If you pass a newline, it will be tokenized in the same way as a space. But we only return character probabilities, so the newline is retained in the output and the text can be reconstructed. |
@eshau thank you for such a thorough investigation! I happened to encounter the same issues, and thanks to your findings, I was able to finally train LoRA :) @markus583 thanks for you great work! Regarding the issue with indexing errors, it occurs even on the dummy dataset that is mentioned on the README page: import torch
torch.save(
{
"language_code": {
"sentence": {
"dummy-dataset": {
"meta": {
"train_data": ["train sentence 1", "train sentence 2"],
},
"data": [
"test sentence 1",
"test sentence 2",
]
}
}
}
},
"dummy-dataset.pth"
) Although the fix ( |
Hi! I followed the instructions for fine-tuning my corpus and (I think) managed to do so successfully after days of debugging. I have A LOT of implementation questions and the following is half-guide, half questions about the process. I want to make it very clear I am VERY thankful for the existence of this library but I feel obligated to point out the issues below.
I first created a new conda environment (with
conda create -n sat-finetune python=3.9
andconda install pip
just to be safe) and ran the following:Then I created the
.pth
dataset as per this format:My config is below:
The first issue that popped up was that
wtpsplit
wasn't installed. To fix this, I added the path of thewtsplit
dir totrain_lora.py
:On hindsight, I think this can be fixed with a
pip install .
in the correct directory but I wasn't sure.After this, I received an outdated CUDA version error as the torch version in the requirements.txt file by default installs
1.7.1
. I tried upgrading to the version on my kernel with the recommended torch command (conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
) but this updates numpy to 2.0 among other things and causes version errors. Downgrading totorch=1.13.1+cu117
did not help (from a brief Google search the version itself is buggy) and I progressively downgraded totorch=1.10.1+cu111
to make it work.This made CUDA work but then I got an index error along these lines:
I believe this is because in
train_lora.py
we add the newline token to the tokenizer:but never update the size of the embedding in the backbone model. This leads to the tokenizer generating the input id token corresponding to the newline token (250002) but the embedding model not being big enough to accommodate it. I thought this was because I had newlines in my sentences but even after removing them I still received this error (I also later realized we added newlines anyway in
prepare_dataset
). To fix this, I added this line:This led to another error I will explain further below but at this point I had a few questions:
1. Should we have newlines in our train / valid sentences?
2. Adding the extra embedding for newline feels like a hack and I am pretty sure the original SaT-3l I used as the base model should have had this in the embedding. Was the wrong model used as the base? And is this new embedding for newline changing enough given we freeze the weights of the original model? Furthermore, is this leaking into the TokenClassifier part of the model?
After this, I tried running
train_lora.py
but the code was stuck. I did some debugging and it was stuck on this line:I did some digging and I think it was because I was running on 4 GPUS and the inter-GPU communication was not working. I tried several things (setting `ddp_backend=gloo` in the `config.json` file, setting `os.environ["NCCL_P2P_DISABLE"]="1"` on top of `train_lora.py`) but the only thing that fixed it was restricting CUDA to one device by setting `os.environ["CUDA_VISIBLE_DEVICES"]="0"` on top of `train_lora.py`. From my understanding of the paper and the Github Issues I have read, the paper's experiments were run on TPUs (on Amazon Sagemaker) and 1 GPU for fine-tuning so this seems like an oversight. I feel like I have to ask the question here:
1. Is there a way to run fine-tuning on multiple GPUs?
After I fixed this, the code ran but I received another error involving infinite values:
When I went through the traceback, I found the error to be in
evaluate.py
, specifically whencompute_metrics
intrain_lora.py
callsevaluate_sentence
which in turn callsget_metrics
here:This is because of this line in
get_metrics
:because
preds
contains-np.inf
. I did some more digging and found it was because we calltoken_to_char_probs
here inevaluate_sentence
:This is because
token_to_char_probs
inutils.__init__.py
initiializes the return tensorchar_probs
as-np.inf
here:Which because we only replace rows whose corresponding character is the last character of a non-special token:
We are left with a lot of
-np.inf
in the first column when we callget_metrics
:and
sklearn.metrics.auc
really does not like that. To fix this, I set the offending line inget_metrics
to besigmoid(pred)
as per the call tof1score
in the same function:With all of these changes, running the initial command with
train_lora.py
andconfig.json
worked. The results look OK but I have another few questions at this point:1. Is the auxiliary objection function necessary for fine-tuning? I tried setting
"use_auxiliary": false
in theconfig.json
file but ran into warnings while training and it errored while loading withSaT
.2. What is the difference between
one_sample_per_line=True
andone_sample_per_line=False
? Does it actually make a difference on training results?3. Is the optimal threshold returned by
compute_metrics
really the best threshold in your empirical experience?4. Is corruption mandatory / desirable for validation? From my understanding the default LORA configs do not corrupt training samples but
evaluate_sentence
has this line:Once again, thank you for reading this and I hope you can answer my questions / concerns.
The text was updated successfully, but these errors were encountered: