Questions Regarding Training Costs, Dtype Error on H100, and ControlNet Loss Behavior #111

zyyyz · 2024-09-17T06:58:59Z

Hi, I’d like to commend you all on this fantastic project—it's truly impressive. I have a few questions and would appreciate any guidance:

Could you provide some details regarding the computational cost of training? Specifically, how much data was used, what type of GPUs were utilized, and how long the training process took?
When following the Accelerate Configuration Example, I encountered an issue when training on 2 H100 setup. The error message I received was:
RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16.
To resolve this, I had to modify the line dit.to(accelerator.device) (line 108 in train_flux_deepspeed_controlnet.py) to dit.to(accelerator.device, dtype=weight_dtype), after which training proceeded normally. I'm not entirely sure what caused this discrepancy—any insight into the root of the issue?
I'm training ControlNet on a small dataset of around 3,500 images. Throughout training, the loss seems to remain within the range of 0.5-0.6 after 10k steps. Is this behavior typical, or should I be concerned that something might be off?

I really appreciate any help or advice you can offer. Thanks again for the amazing work you're doing!

The text was updated successfully, but these errors were encountered:

bonlime · 2024-11-06T15:15:08Z

@zyyyz have you been able to successfully train model using code from this repo?

tianqyun111 · 2024-11-16T14:02:26Z

Is there any new progress? i trained pose controlnet with 50000 images,but when inference, even i set strength to 1,The image does not have any guided effect.Anyone can help me?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions Regarding Training Costs, Dtype Error on H100, and ControlNet Loss Behavior #111

Questions Regarding Training Costs, Dtype Error on H100, and ControlNet Loss Behavior #111

zyyyz commented Sep 17, 2024

bonlime commented Nov 6, 2024

tianqyun111 commented Nov 16, 2024

Questions Regarding Training Costs, Dtype Error on H100, and ControlNet Loss Behavior #111

Questions Regarding Training Costs, Dtype Error on H100, and ControlNet Loss Behavior #111

Comments

zyyyz commented Sep 17, 2024

bonlime commented Nov 6, 2024

tianqyun111 commented Nov 16, 2024