How to add a new decoder after gpt is created with ::new call ? #15

cutoken · 2023-06-12T05:38:13Z

hi @keyvank ,
let's say I want to add a new decoder layer (the one that gets constructed as part of 0..num_layers loop) at run time after the gpt::new() call, how do I go about it ? As I understand you are just pushing the computations one by one incremented by tensorid so adding a layer at a later point of time will need incrementing the ids for the next layers as well (for example adding one more decoder layer along with all the sub layers like attention etc means incrementing the vocab out and other variables outside the for loop ?)

Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) )

keyvank · 2023-06-12T09:16:41Z

Good question. I have done this manually but it's tricky. You have to dis-allow loading tensors coming after the last layer in your new model, and this way you can migrate your old weights to your new model.

Computation is BTreeMap because not all tensors are a results of a computation (E.g input or parameter tensors), and we also need to process them in a sorted way

keyvank · 2023-06-12T10:25:51Z

@cutoken Please check the latest commit: 9d9db58

cutoken · 2023-06-12T10:32:38Z

Yes. saw that :)
So if I'm understanding this correctly, to add a new layer, I stop the training, set optimizer to false, increase the layers and restart the training ?

keyvank · 2023-06-12T10:36:32Z

@cutoken
Yes, you can turn on optimizer again after it saved the training-data again. (But it will start from step 0, which is maybe not very efficient!)

cutoken · 2023-06-12T10:37:45Z

It might start from step 0 but will the weights, biases of layer 1 and other intermediate components from training run stay as is or will they be reset to random ?

keyvank · 2023-06-12T10:39:43Z

new layers are random. old layers keep old weights

cutoken · 2023-06-12T10:42:10Z

Got it. Now one more ask on similar angle. Is there a way to not run backward pass on a particular layer ? Since they are so costly, I want to exclude earlier layers from the training. Ideal would be some kind of layer numbers I can pass to the optimizer so that it just ignores the computation for those layers.

keyvank · 2023-06-12T10:44:13Z

You can't do it on a "particular layer". But you can do it only from the last-layer to a particular layer. (

cutoken · 2023-06-12T10:46:21Z

Ya that is what I actually want. For example, I have started the training with 2 layers, trained them to death :D (loss not reducing anymore), restarted my training with 4 layers, now I want original layers 0,1 not to be having backward pass while new layers 2, 3 (and their sub components like self attention heads etc) to be trained normally. How do I achieve something like that ?

keyvank · 2023-06-12T11:26:10Z

@cutoken Pushed something that is useful for you. But be aware, this might confuse the Adam optimizer (Since the gradients of other layers will be zero)

cutoken · 2023-06-12T11:30:26Z

saw your commit. But is the computations matching the layers ? 🤔 I mean the number of computations done in backward pass is always equal to the number of decoder layers ? Only if they are the commit would work I think @keyvank

cutoken · 2023-06-12T11:32:03Z

Okay that zero grad problem can solved if we just clone the last layer. It's as good as any random value anyway :)

keyvank · 2023-06-12T11:33:03Z

@cutoken No, it's not. you have to calculate how many computations are done from your new layers till the last computation

keyvank · 2023-06-12T11:35:14Z

I think this is the formula:

n=3 + ((10*num_heads) + 12) * num_new_layers

where head_size = embedding_degree / num_heads

cutoken · 2023-06-12T11:37:22Z

Got it. I'm thinking of just storing of the layer number in computation during the creation. That way no need for any calculation and it would reliably work. So call function would take the layer number and layer type. If layer type is decoder and layer number is < start layer/limit we do nothing.

keyvank · 2023-06-12T11:42:50Z

Hmmmm this could work, but I want this library to be general purpose, what you are proposing needs adding GPT logic into the Graph logic. You can just forget about the calcs and use a number like 200, it doesn't need to be accurate hah

cutoken · 2023-06-12T11:50:07Z

Yes. This doesn't make much sense unless we expose this as a functionality in the library. For now if I get it working I'll keep it as a add-on in my branch. My guess is that fine tuning would be faster with frozen layers but I wouldn't know unless I try the experiments :D . I'll keep you posted on the results. If it is indeed faster we can consider it in the main branch.

cutoken · 2023-06-14T07:51:14Z

Update on experimenting with this one:

It does work in saving time
Savings however aren't going to be high unless the model is deep
It has negligible impact on the training quality - will be even better once we remove the need for learning embeddings (through sentencepiece) and positional encodings (through sine and cos) at the input layer.

keyvank added a commit that referenced this issue Jun 12, 2023

Switch to Vec instead of BTreeMap for tensors and their grads #15

3daece3

keyvank added a commit that referenced this issue Jun 12, 2023

Allow limiting backward pass to last n computations, #15

fdfdb2d

keyvank added the enhancement New feature or request label Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add a new decoder after gpt is created with ::new call ? #15

How to add a new decoder after gpt is created with ::new call ? #15

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

keyvank commented Jun 12, 2023 •

edited

Loading

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

cutoken commented Jun 14, 2023

How to add a new decoder after gpt is created with ::new call ? #15

How to add a new decoder after gpt is created with ::new call ? #15

Comments

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

keyvank commented Jun 12, 2023 • edited Loading

cutoken commented Jun 12, 2023

keyvank commented Jun 12, 2023

cutoken commented Jun 12, 2023

cutoken commented Jun 14, 2023

keyvank commented Jun 12, 2023 •

edited

Loading