Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add a new decoder after gpt is created with ::new call ? #15

Open
cutoken opened this issue Jun 12, 2023 · 18 comments
Open

How to add a new decoder after gpt is created with ::new call ? #15

cutoken opened this issue Jun 12, 2023 · 18 comments
Labels
enhancement New feature or request

Comments

@cutoken
Copy link
Contributor

cutoken commented Jun 12, 2023

hi @keyvank ,
let's say I want to add a new decoder layer (the one that gets constructed as part of 0..num_layers loop) at run time after the gpt::new() call, how do I go about it ? As I understand you are just pushing the computations one by one incremented by tensorid so adding a layer at a later point of time will need incrementing the ids for the next layers as well (for example adding one more decoder layer along with all the sub layers like attention etc means incrementing the vocab out and other variables outside the for loop ?)

Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) )

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

Good question. I have done this manually but it's tricky. You have to dis-allow loading tensors coming after the last layer in your new model, and this way you can migrate your old weights to your new model.

Computation is BTreeMap because not all tensors are a results of a computation (E.g input or parameter tensors), and we also need to process them in a sorted way

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

@cutoken Please check the latest commit: 9d9db58

@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

Yes. saw that :)
So if I'm understanding this correctly, to add a new layer, I stop the training, set optimizer to false, increase the layers and restart the training ?

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

@cutoken
Yes, you can turn on optimizer again after it saved the training-data again. (But it will start from step 0, which is maybe not very efficient!)

@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

It might start from step 0 but will the weights, biases of layer 1 and other intermediate components from training run stay as is or will they be reset to random ?

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

new layers are random. old layers keep old weights

@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

Got it. Now one more ask on similar angle. Is there a way to not run backward pass on a particular layer ? Since they are so costly, I want to exclude earlier layers from the training. Ideal would be some kind of layer numbers I can pass to the optimizer so that it just ignores the computation for those layers.

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

You can't do it on a "particular layer". But you can do it only from the last-layer to a particular layer. (

@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

Ya that is what I actually want. For example, I have started the training with 2 layers, trained them to death :D (loss not reducing anymore), restarted my training with 4 layers, now I want original layers 0,1 not to be having backward pass while new layers 2, 3 (and their sub components like self attention heads etc) to be trained normally. How do I achieve something like that ?

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

@cutoken Pushed something that is useful for you. But be aware, this might confuse the Adam optimizer (Since the gradients of other layers will be zero)

@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

saw your commit. But is the computations matching the layers ? 🤔 I mean the number of computations done in backward pass is always equal to the number of decoder layers ? Only if they are the commit would work I think @keyvank

@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

Okay that zero grad problem can solved if we just clone the last layer. It's as good as any random value anyway :)

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

@cutoken No, it's not. you have to calculate how many computations are done from your new layers till the last computation

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

I think this is the formula:

n=3 + ((10*num_heads) + 12) * num_new_layers

where head_size = embedding_degree / num_heads

@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

Got it. I'm thinking of just storing of the layer number in computation during the creation. That way no need for any calculation and it would reliably work. So call function would take the layer number and layer type. If layer type is decoder and layer number is < start layer/limit we do nothing.

@keyvank
Copy link
Owner

keyvank commented Jun 12, 2023

Hmmmm this could work, but I want this library to be general purpose, what you are proposing needs adding GPT logic into the Graph logic. You can just forget about the calcs and use a number like 200, it doesn't need to be accurate hah

@keyvank keyvank added the enhancement New feature or request label Jun 12, 2023
@cutoken
Copy link
Contributor Author

cutoken commented Jun 12, 2023

Yes. This doesn't make much sense unless we expose this as a functionality in the library. For now if I get it working I'll keep it as a add-on in my branch. My guess is that fine tuning would be faster with frozen layers but I wouldn't know unless I try the experiments :D . I'll keep you posted on the results. If it is indeed faster we can consider it in the main branch.

@cutoken
Copy link
Contributor Author

cutoken commented Jun 14, 2023

Update on experimenting with this one:

  1. It does work in saving time
  2. Savings however aren't going to be high unless the model is deep
  3. It has negligible impact on the training quality - will be even better once we remove the need for learning embeddings (through sentencepiece) and positional encodings (through sine and cos) at the input layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants