How to use float8 for training? #2201

vgoklani · 2024-12-23T17:27:40Z

Are there any examples for training the MLP blocks using float8 from torchao?

Thanks!

calvinpelletier · 2024-12-23T18:25:47Z

Hi @vgoklani , we don't currently support this, but you could modify a recipe to call torchao.float8.convert_to_float8_training on your model at the end of this function.

However, I recommend using QLoRA, where frozen base model params are quantized to a lower precision (NF4), while the trainable adapter params are kept in a higher precision.

Here's an example config. The QLoRA model builder replaces the model's linear layers with LoRALinear(...,quantize_base=True) layers. If you want to use float8 instead of nf4, you can modify the LoRALinear class.

Let me know if you have any questions!

vgoklani · 2024-12-23T22:20:27Z

Thanks @calvinpelletier. We are using the full-finetune scripts, and since the hardware already supports FP8, we are just leaving a lot of performance on the table... We can add it to our internal version, but I would imagine that there are other groups that want this included too.

calvinpelletier · 2024-12-23T23:27:29Z

We would definitely appreciate a PR if full-finetuning in FP8 works out well for you all!

gau-nernst · 2024-12-24T05:26:30Z

I was working on adding INT8 training to torchtune #1552, and FP8 was also on the discussion. Once the INT8 PR is merged, we can make another one for FP8 too, since it follows a similar design.

vgoklani · 2024-12-24T16:49:43Z

Thank you @calvinpelletier and @gau-nernst

Using Dynamic scaling with the torachao api was trivial, and gave a ~30% performance boost in tokens-per-second

We're running on 4x NVIDIA A6000 Ada cards (SM89)

from torchao.float8 import (
    CastConfig,
    Float8LinearConfig,
    ScalingType,
    convert_to_float8_training,
)

config = Float8LinearConfig(
    enable_fsdp_float8_all_gather=True,
    force_recompute_fp8_weight_in_bwd=True,
    cast_config_input=CastConfig(scaling_type=ScalingType.DYNAMIC),
    cast_config_weight=CastConfig(scaling_type=ScalingType.DYNAMIC),
    cast_config_grad_output=CastConfig(scaling_type=ScalingType.DYNAMIC),
)

convert_to_float8_training(mlp, config=config)

strangely enough, using DELAYED scaling crashed torch.compile... will need to dig into that further...

gau-nernst · 2024-12-25T01:28:56Z

@vgoklani Delayed scaling is not as well-supported as dynamic scaling I think. Should be fine to stick to dynamic scaling.

Curious. Do you observe any convergence issue?

vgoklani · 2024-12-25T02:16:18Z

@gau-nernst The loss was very close to bfloat16! I'm looking forward to int8 training :)

felipemello1 mentioned this issue Dec 24, 2024

Multi GPU timeout on save checkpoint (WorkNCCL, Watchdog, timeout) #2093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use float8 for training? #2201

How to use float8 for training? #2201

vgoklani commented Dec 23, 2024

calvinpelletier commented Dec 23, 2024

vgoklani commented Dec 23, 2024

calvinpelletier commented Dec 23, 2024

gau-nernst commented Dec 24, 2024

vgoklani commented Dec 24, 2024 •

edited

Loading

gau-nernst commented Dec 25, 2024 •

edited

Loading

vgoklani commented Dec 25, 2024 •

edited

Loading

How to use float8 for training? #2201

How to use float8 for training? #2201

Comments

vgoklani commented Dec 23, 2024

calvinpelletier commented Dec 23, 2024

vgoklani commented Dec 23, 2024

calvinpelletier commented Dec 23, 2024

gau-nernst commented Dec 24, 2024

vgoklani commented Dec 24, 2024 • edited Loading

gau-nernst commented Dec 25, 2024 • edited Loading

vgoklani commented Dec 25, 2024 • edited Loading

vgoklani commented Dec 24, 2024 •

edited

Loading

gau-nernst commented Dec 25, 2024 •

edited

Loading

vgoklani commented Dec 25, 2024 •

edited

Loading