Pretrained Model #94

nivibilla · 2023-04-17T22:32:57Z

Hi,

Are you planning on releasing a pretrained model anytime?

Thanks

spacewalkingninja · 2023-04-18T16:02:59Z

I also need the model, you must democratize AI as not everybody got money to spend on GPU's. I can't train it, please release a free SOTA model so taht we can make TTS industrially.

nivibilla · 2023-04-18T17:18:43Z

To be fair, even if they do release it. I don't think it can be run on anything less than a 4090. And if you have a 4090, you can train it yourself.

nivibilla · 2023-04-18T20:14:45Z

Then again, we could try use BitsandBytes

RuntimeRacer · 2023-04-26T01:43:40Z

I also need the model, you must democratize AI as not everybody got money to spend on GPU's. I can't train it, please release a free SOTA model so taht we can make TTS industrially.

They 'must' do nothing honestly. Rather be thankful that this is implemented and released under an open source license already.

Also there is a model that has been shared here: #58 (comment)

nivibilla · 2023-04-26T06:27:26Z

True. But still, a full pretrained model would be nice.

RahulBhalley · 2023-04-30T09:55:13Z

I can try to train it. But I have some queries:

How much memory and time can it take to train the model?
Can I mix languages such as French, Spanish, and English from Common Voice?

nivibilla · 2023-04-30T10:25:28Z

@RahulBhalley The author of the repo said he trained it on a single gpu(so something like a 3090?). In terms of time I'm not sure. But #58 said he trained it for 4 days on 8 A100s. And the dataset for that was LibriTTS(600 hours), whereas the original Vall-E was trained on 60,000 hours. But I'm not sure how @lifeiteng managed to reproduce the paper's results so quickly only using a single GPU, @lifeiteng could you let us know how long it took.

In terms of mixing languages. I don't see why not as long as the training data is processed into phonemes. But this may affect performance. I would suggest maybe train it in English first so that you can reproduce the original results. Then maybe finetune it for other languages using LoRA?

RahulBhalley · 2023-04-30T12:04:38Z

Thanks for the suggestions!

@RahulBhalley The author of the repo said he trained it on a single gpu(so something like a 3090?). In terms of time I'm not sure. But #58 said he trained it for 4 days on 8 A100s. But I'm not sure how @lifeiteng managed to reproduce the paper's results so quickly only using a single GPU, @lifeiteng could you let us know how long it took.

Maybe I'll try 4090 or something. About A100, 80GB VRAM seems like an overkill. I'd rather try A100 40GB or 4090 (whichever turns out to be faster). That'll be very helpful to know @lifeiteng's experience with training on single GPU.

And the dataset for that was LibriTTS(600 hours), whereas the original Vall-E was trained on 60,000 hours.

Common Voice has 3,209 hours (2,429 hours validated) and 86,942 speakers! Although the hours are a lot less than LibriLight but the diversity of voices is 12.4x more! I think it'll be able to do better unseen voice cloning.

In terms of mixing languages. I don't see why not as long as the training data is processed into phonemes. But this may affect performance. I would suggest maybe train it in English first so that you can reproduce the original results. Then maybe finetune it for other languages using LoRA?

Sure. I don't know much about LoRA yet (got to know about it when DreamBooth-ing earlier). Still have to read the paper.

Then again, we could try use BitsandBytes

This is gonna speed up the training by a lot! But looking at re-implementation of Adam as scaled Adam

vall-e/valle/modules/optim.py

Line 129 in 27c0667

class ScaledAdam(BatchedOptimizer):

I'm not sure how to use bnb.optim.AdamW8bit instead.

nivibilla · 2023-04-30T12:22:18Z

@RahulBhalley using bnb is super easy

Import bitsandbytes as bnb 
optim_g = bnb.optim.AdamW(...)

You can use it as a drop in replacement.

nivibilla · 2023-04-30T12:22:47Z

I used it for finetuning vits and it saved me almost 3gb of vram

https://github.com/nivibilla/efficient-vits-finetuning

RahulBhalley · 2023-04-30T12:30:51Z

Look that this statement.

vall-e/valle/bin/trainer.py

Line 148 in 27c0667

default="ScaledAdam",

It uses ScaledAdam (a custom implementation from scratch).

vall-e/valle/modules/optim.py

Line 129 in 27c0667

class ScaledAdam(BatchedOptimizer):

But scaled Adam doesn't exist in bitsandbytes. I don't know if using AdamW from bitsandbytes will converge VALL-E less.

I think I'll go with AdamW from bitsandbytes.

nivibilla · 2023-04-30T12:41:02Z

I see. I mean see how much vram you save. If it's only something like 3gb. Is it really worth? The point of using 8bit optimisers is mainly for finetuning so we can fit a bigger batch size. If it's not causing OOM maybe bitsandbytes isn't needed

nivibilla · 2023-04-30T12:41:30Z

Have you had a look at the other Vall-E implementation? It uses Deepspeed.

nivibilla · 2023-04-30T12:42:38Z

Btw the original paper used AdamW

nivibilla · 2023-04-30T12:45:48Z

Also for multiple languages, Vall-E X exists but no implementation. And Natural Speech 2 seems very promising. But implementation is on the way by lucidrains

RahulBhalley · 2023-04-30T12:50:04Z

I see. I mean see how much vram you save. If it's only something like 3gb. Is it really worth? The point of using 8bit optimisers is mainly for finetuning so we can fit a bigger batch size. If it's not causing OOM maybe bitsandbytes isn't needed

Okay, I thought it'll speed up the training. 🤔 I should put the model on fp16 data type instead.

Have you had a look at the other Vall-E implementation? It uses Deepspeed.

I don't mind the VRAM. Just wanted to speed up training.

Btw the original paper used AdamW

Cool. Will try that.

Also for multiple languages, Vall-E X exists but no implementation. And Natural Speech 2 seems very promising. But implementation is on the way by lucidrains

Wow! The samples are incredible. Reminds of NANSY++ where Jay-Z raps the lyrics of Nas song.

nivibilla · 2023-04-30T12:53:13Z

Yeah Natural Speech 2 is amazing. I'm keeping a close eye on the implementation by lucidrains. It's being sponsored by Stability so if we're lucky he may provide a pretrained model

nivibilla · 2023-04-30T13:03:13Z

Btw if you want some more insight, there is a long thread here about mrq training vall e
https://git.ecker.tech/mrq/ai-voice-cloning/issues/152

Apparently on a 4070 ti lol

RahulBhalley · 2023-04-30T13:08:30Z

Haha, okay.

RuntimeRacer · 2023-05-01T17:33:18Z

In case anyone is interested in trying with CommonVoice also, I just created a PR for a CV dataset preparation script I created: #111

nivibilla · 2023-05-01T18:00:01Z

This is nice @RuntimeRacer . Do you know what the differences are from the paper?

RuntimeRacer · 2023-05-01T18:18:06Z

@nivibilla As far as I am aware they only trained the Model with LibriLight in the Paper, which consisted of 60k hours pure english speech. I didn't do a breakdown of hours per language for languages I downloaded, but you can check this here, it is using CommonVoice 13: https://commonvoice.mozilla.org/en/datasets

I was more curious to see if the model is actually capable to learn speech-related dialects and apply these also on different languages when for example using a japanese speaker + text as a prompt but letting it generate audio for an english sentence, which I found https://github.com/serp-ai/bark-with-voice-clone being capable of, however that model proved to be not very robust on this task (or cloning voices from arbitrary samples in general).

nivibilla · 2023-05-01T18:35:36Z

@RuntimeRacer I see. Yeah I would try train it myself if I can. I'm thinking of maybe adapting the code to Deepspeed/accelerate so that I can do nvme offloading. It will take a painstakingly long time but at least I can train. But it will converge so slowly

RuntimeRacer · 2023-05-01T20:22:18Z

@nivibilla Yes having Accelerator for this repo would be awesome; I was considering looking into that myself; but since I am pretty busy these days I'm currently rather hoping that folks here will be able to fix the multi GPU issues faster than migrating the training code 😅
At least it seems someone is aware now after I highlighted there's still an issue: #86

RuntimeRacer · 2023-05-01T20:25:01Z

Tensorboard of my CV Training so far - It didn't even iterate through the first epoch yet:

nivibilla · 2023-05-01T20:27:58Z

@RuntimeRacer those are some Interesting graphs. I wonder why it converges so quickly.

nivibilla · 2023-05-01T20:29:04Z

Train goes down and valid goes up indicates overfitting but surely the dataset is not that small

nivibilla · 2023-05-01T20:31:51Z

To be fair it's almost 300k steps. Did you try inference?

RuntimeRacer · 2023-05-01T20:34:27Z

That's my commandline used for training:

python3 bin/trainer.py --max-duration 60 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 --num-buckets 6 --dtype bfloat16 --save-every-n 5000 --valid-interval 5000 --model-name valle --share-embedding true --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 --base-lr 0.05 --warmup-steps 200 --average-period 0 --num-epochs 200 --start-epoch 1 --start-batch 160000 --accumulate-grad-steps 4 --world-size 1 --exp-dir exp/valle

Regarding overfitting, I had to change --max-duration to 60 form 80 at 160k steps because I hit an OOM error with the dataset loader. Not sure if that's the case here, but I experienced 'movements' in loss bias with changed dataset sizes and world sizes also with other models I trained previously, so that might explain the upshift in loss mean at 160k.

Regarding the valid loss jumps I assume the model still needs to generalize on new tokens or speaker characteristics because it didn't loop through a whole epoch yet. But that's just blind guessing from a dev with no theoretical background in data science.

RuntimeRacer · 2023-05-01T20:36:07Z

To be fair it's almost 300k steps. Did you try inference?

No, I understood I need to train AR model until I reach a definite minimum for valid loss, and then train the NAR model based on that checkpoint first. But I can try ofc and see whether it gives us anything except static

nivibilla · 2023-05-01T20:43:33Z

@RuntimeRacer thanks for looking into it. I will try look at the code tomorrow and see if it's viable to migrate to accelerate. Deepspeed is probably not needed as you have enough gpus.

RuntimeRacer · 2023-05-01T21:30:39Z

@nivibilla I just tried Inference with one of my checkpoints; it's very static and hard to understand. But it generates the sentences prompted and also does the (expected) dialect transfer into different languages, with deterministic speaker character for all prompts when using the same input audio and text reference.
However you need quite some fantasy at this point to believe it could be the person from the actual input that's trying to speak there.
I'll keep you updated.

Oh, and I also tried generating audio with Chinese and Japanese Alphabet. This makes the generation fill up VRAM and run into OOM. However it works with latin letter representation of japanese or chinese sentences.

nivibilla · 2023-05-01T21:38:32Z

@RuntimeRacer thanks for the update! I had a quick look at the code. It seems a bit hard but it should be doable to implement accelerate. I can replace DDP with accelerate but im not sure how that will affect saving and loading checkpoint and validation. As this happens on rank 0.

nivibilla · 2023-05-01T21:41:38Z

Also there's a lot of code where he uses module. It may be just easier to just use the other vall e implementation which has Deepspeed.

RuntimeRacer · 2023-05-01T21:59:59Z

@nivibilla yes I actually found the other implementation first before I found this one; but I eventually went with this one here because it's still being actively maintained and I liked the fact that it just builds on top of lhotse / icefall (which I didn't know before actually), because this allows for a standardized setup with arbitrary ML datasets as long as they're part of lhotse. And I probably hoped the Multi-GPU Training to work properly. 😂

But yeah maybe I'll give the other one a try, too, despite it will require more work arranging the training data I guess.

nivibilla · 2023-05-01T22:08:55Z

@RuntimeRacer yeah the other author is AWOL. It would have been nice if the repo was built from the ground up in accelerate. I feel like I will break too many things while trying to implement accelerate. If the native multi GPU works then it's probably not worth the hassle. Liefteng said in the issue you mentioned that it's being fixed so hopefully soon.

nivibilla · 2023-05-01T22:24:34Z

@RuntimeRacer there is 1 more implementation (https://git.ecker.tech/mrq/vall-e) it's a fork of the enhuiz one. With some improvements like using bitsandbytes. It doesn't have the icefall and lhoste but it's better than the other one

RuntimeRacer · 2023-05-01T22:33:53Z

Hmm yeah that seems like a better curated one. Also good hint in their docs to apply SGDR (https://markkhoffmann.medium.com/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e) while training; I already made good experiences with that method when I trained Tacotron some time ago.

I might give it a try; just not sure when. Also I never heard of https://git.ecker.tech/ btw 😂

nivibilla · 2023-05-01T22:36:08Z

Yeah, mrq says he doesn't trust GitHub so he uses that instead. It's essentially a clone so idk what the difference is lol

RuntimeRacer · 2023-05-04T23:13:46Z

@nivibilla I just sat down for an hour and did this: #115
Didn't have a chance to test yet, since I want the current training Epoch running on one GPU to finish first (for the first time, after I stripped non-latin alphabet languages and it finally seems to have no errors anymore)
But maybe you want to check the code for potential issues in the meantime. :-)

nivibilla · 2023-05-05T06:35:44Z

Amazing. I still don't get why phonemizing is such a vram hog. But anyway. At least it works

RuntimeRacer · 2023-05-07T11:16:52Z

Accelerate was a lot bigger task than I expected due to Lhotse limitations. However I was able to fix DDP: #116

RuntimeRacer mentioned this issue May 2, 2023

Cuda OOM error when "saving batch" #110

Open

Pretrained Model #94

Pretrained Model #94

Comments

nivibilla commented Apr 17, 2023

spacewalkingninja commented Apr 18, 2023

nivibilla commented Apr 18, 2023

nivibilla commented Apr 18, 2023

RuntimeRacer commented Apr 26, 2023

nivibilla commented Apr 26, 2023

RahulBhalley commented Apr 30, 2023

nivibilla commented Apr 30, 2023

RahulBhalley commented Apr 30, 2023 • edited Loading

nivibilla commented Apr 30, 2023 • edited Loading

nivibilla commented Apr 30, 2023 • edited Loading

RahulBhalley commented Apr 30, 2023 • edited Loading

nivibilla commented Apr 30, 2023

nivibilla commented Apr 30, 2023

nivibilla commented Apr 30, 2023

nivibilla commented Apr 30, 2023

RahulBhalley commented Apr 30, 2023 • edited Loading

nivibilla commented Apr 30, 2023

nivibilla commented Apr 30, 2023

RahulBhalley commented Apr 30, 2023

RuntimeRacer commented May 1, 2023

nivibilla commented May 1, 2023

RuntimeRacer commented May 1, 2023 • edited Loading

nivibilla commented May 1, 2023

RuntimeRacer commented May 1, 2023

RuntimeRacer commented May 1, 2023

nivibilla commented May 1, 2023

nivibilla commented May 1, 2023

nivibilla commented May 1, 2023

RuntimeRacer commented May 1, 2023 • edited Loading

RuntimeRacer commented May 1, 2023

nivibilla commented May 1, 2023

RuntimeRacer commented May 1, 2023 • edited Loading

nivibilla commented May 1, 2023

nivibilla commented May 1, 2023

RuntimeRacer commented May 1, 2023 • edited Loading

nivibilla commented May 1, 2023

nivibilla commented May 1, 2023

RuntimeRacer commented May 1, 2023

nivibilla commented May 1, 2023

RuntimeRacer commented May 4, 2023

nivibilla commented May 5, 2023

RuntimeRacer commented May 7, 2023

RahulBhalley commented Apr 30, 2023 •

edited

Loading

nivibilla commented Apr 30, 2023 •

edited

Loading

nivibilla commented Apr 30, 2023 •

edited

Loading

RahulBhalley commented Apr 30, 2023 •

edited

Loading

RahulBhalley commented Apr 30, 2023 •

edited

Loading

RuntimeRacer commented May 1, 2023 •

edited

Loading

RuntimeRacer commented May 1, 2023 •

edited

Loading

RuntimeRacer commented May 1, 2023 •

edited

Loading

RuntimeRacer commented May 1, 2023 •

edited

Loading