You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
Thanks for your awesome work. I needed some help in finetuning flux on my custom dataset based on agriculture. I have written the a custom dataloader as following using the example guide. Now when I try to train the ControlNet I get the following error:
/root/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py:443: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
10/18/2024 19:05:41 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
DEVICE cuda:0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6.24it/s]
Init model
Loading checkpoint
Init AE
743.80728 parameters
10/18/2024 19:05:59 - INFO - __main__ - ***** Running training *****
10/18/2024 19:05:59 - INFO - __main__ - Num Epochs = 15
10/18/2024 19:05:59 - INFO - __main__ - Instantaneous batch size per device = 1
10/18/2024 19:05:59 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 2
10/18/2024 19:05:59 - INFO - __main__ - Gradient Accumulation steps = 2
10/18/2024 19:05:59 - INFO - __main__ - Total optimization steps = 10000
Checkpoint 'latest' does not exist. Starting a new training run.
Steps: 0%|| 0/10000 [00:00<?, ?it/s]torch.Size([1]) torch.Size([1, 1024, 64]) torch.Size([1, 1024, 64])
Steps: 0%|| 0/10000 [00:03<?, ?it/s, lr=2e-5, step_loss=0.74]torch.Size([1]) torch.Size([1, 1024, 64]) torch.Size([1, 1024, 64])
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/iqbal/x-flux/train_flux_deepspeed_controlnet.py", line 317, in<module>
[rank0]: main()
[rank0]: File "/home/iqbal/x-flux/train_flux_deepspeed_controlnet.py", line 228, in main
[rank0]: block_res_samples = controlnet(
[rank0]: File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank0]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]: File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1528, in _pre_forward
[rank0]: iftorch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward`functionoutputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss functionand the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameters which did not receive grad for rank 0: double_blocks.1.txt_mlp.2.bias, double_blocks.1.txt_mlp.2.weight, double_blocks.1.txt_mlp.0.bias, double_blocks.1.txt_mlp.0.weight, double_blocks.1.txt_attn.proj.bias, double_blocks.1.txt_attn.proj.weight
[rank0]: Parameter indices which did not receive grad for rank 0: 58 59 60 61 62 63
Steps: 0%|| 0/10000 [00:03<?, ?it/s, lr=2e-5, step_loss=0.74]
[rank0]:[W1018 19:06:03.954239992 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Here is how my custom data loader looks like.
importosimportpandasaspdimportnumpyasnpfromPILimportImageimporttorchfromtorch.utils.dataimportDataset, DataLoaderfromglobimportglobfromos.pathimportjoinimportjsonimportrandomimportcv2classPhenobench(Dataset):
def__init__(self, img_dir, img_size=512):
self.source_image_dir=join(img_dir, 'plants_panoptic_train')
self.target_image_dir=join(img_dir, 'train', 'images')
self.source_images=sorted(glob(self.source_image_dir+'/*.png'))
self.target_images=sorted(glob(self.target_image_dir+'/*.png'))
assertlen(self.source_images) ==len(self.target_images)
self.img_size=img_sizedef__len__(self):
returnlen(self.source_images)
def__getitem__(self, idx):
source_filename=self.source_images[idx]
target_filename=self.target_images[idx]
if'05-15'intarget_filename:
prompt="sugarbeet crops and weed plants of different species with dark green colored leaves from early growth stages with sunny lighting conditions in the morning and dry darker brown soil background"elif'05-26'intarget_filename:
prompt="sugarbeet crops and weed plants of different species with dark green colored leaves from early stages with sunny lighting conditions in the afternoon and dry lighter brown soil background"elif'06-05'intarget_filename:
prompt="sugarbeet crops and weed plants of different species with dark green colored leaves from later growth stages with overcast weather conditions without shadows and dark brown soil background with a bit of moisture"else:
prompt="None"source=Image.open(source_filename)
target=Image.open(target_filename)
source=source.resize((self.img_size, self.img_size))
target=target.resize((self.img_size, self.img_size))
source=torch.from_numpy((np.array(source) /127.5) -1)
target=torch.from_numpy((np.array(target) /127.5) -1)
source=source.permute(2, 0, 1)
target=target.permute(2, 0, 1)
returntarget, source, promptdefloader(train_batch_size, num_workers, **args):
dataset=Phenobench(**args)
returnDataLoader(dataset, batch_size=train_batch_size, num_workers=num_workers, shuffle=True)
Here is the error extracted from the logs: Parameters which did not receive grad for rank 0: double_blocks.1.txt_mlp.2.bias, double_blocks.1.txt_mlp.2.weight, double_blocks.1.txt_mlp.0.bias, double_blocks.1.txt_mlp.0.weight, double_blocks.1.txt_attn.proj.bias, double_blocks.1.txt_attn.proj.weight
I am trying to train it on an A100-80GB GPU. If needed, I can use multiple ones as well. I can also reduce the image size. I have tried running the script using both
When I try training the LoRa, I get the same error but in a different way: [rank0]: raise ConfigKeyError(f"Missing key {key!s}")
[rank0]: omegaconf.errors.ConfigAttributeError: Missing key double_blocks
[rank0]: full_key: double_blocks
[rank0]: object_type=dict
Kindly help in narrowing down what could be wrong. My conda environment is build using the same requirements.txt file you provided. And the models were download using the huggingface-cli token. Any help is appreciated.
Thank you.
The text was updated successfully, but these errors were encountered:
Hi,
Thanks for your awesome work. I needed some help in finetuning flux on my custom dataset based on agriculture. I have written the a custom dataloader as following using the example guide. Now when I try to train the ControlNet I get the following error:
Here is how my custom data loader looks like.
Here is the error extracted from the logs:
Parameters which did not receive grad for rank 0: double_blocks.1.txt_mlp.2.bias, double_blocks.1.txt_mlp.2.weight, double_blocks.1.txt_mlp.0.bias, double_blocks.1.txt_mlp.0.weight, double_blocks.1.txt_attn.proj.bias, double_blocks.1.txt_attn.proj.weight
I am trying to train it on an A100-80GB GPU. If needed, I can use multiple ones as well. I can also reduce the image size. I have tried running the script using both
When I try training the LoRa, I get the same error but in a different way:
[rank0]: raise ConfigKeyError(f"Missing key {key!s}")
[rank0]: omegaconf.errors.ConfigAttributeError: Missing key double_blocks
[rank0]: full_key: double_blocks
[rank0]: object_type=dict
Kindly help in narrowing down what could be wrong. My conda environment is build using the same requirements.txt file you provided. And the models were download using the huggingface-cli token. Any help is appreciated.
Thank you.
The text was updated successfully, but these errors were encountered: