Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Keys double_block error when fine tuning ControlNet and Lora. #125

Open
niqbal996 opened this issue Oct 18, 2024 · 0 comments
Open

Comments

@niqbal996
Copy link

Hi,
Thanks for your awesome work. I needed some help in finetuning flux on my custom dataset based on agriculture. I have written the a custom dataloader as following using the example guide. Now when I try to train the ControlNet I get the following error:

/root/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py:443: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
10/18/2024 19:05:41 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

DEVICE cuda:0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.24it/s]
Init model
Loading checkpoint
Init AE
743.80728 parameters
10/18/2024 19:05:59 - INFO - __main__ - ***** Running training *****
10/18/2024 19:05:59 - INFO - __main__ -   Num Epochs = 15
10/18/2024 19:05:59 - INFO - __main__ -   Instantaneous batch size per device = 1
10/18/2024 19:05:59 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
10/18/2024 19:05:59 - INFO - __main__ -   Gradient Accumulation steps = 2
10/18/2024 19:05:59 - INFO - __main__ -   Total optimization steps = 10000
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                                                                                           | 0/10000 [00:00<?, ?it/s]torch.Size([1]) torch.Size([1, 1024, 64]) torch.Size([1, 1024, 64])
Steps:   0%|                                                                                                                                  | 0/10000 [00:03<?, ?it/s, lr=2e-5, step_loss=0.74]torch.Size([1]) torch.Size([1, 1024, 64]) torch.Size([1, 1024, 64])
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/iqbal/x-flux/train_flux_deepspeed_controlnet.py", line 317, in <module>
[rank0]:     main()
[rank0]:   File "/home/iqbal/x-flux/train_flux_deepspeed_controlnet.py", line 228, in main
[rank0]:     block_res_samples = controlnet(
[rank0]:   File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:   File "/root/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1528, in _pre_forward
[rank0]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank0]: making sure all `forward` function outputs participate in calculating loss. 
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameters which did not receive grad for rank 0: double_blocks.1.txt_mlp.2.bias, double_blocks.1.txt_mlp.2.weight, double_blocks.1.txt_mlp.0.bias, double_blocks.1.txt_mlp.0.weight, double_blocks.1.txt_attn.proj.bias, double_blocks.1.txt_attn.proj.weight
[rank0]: Parameter indices which did not receive grad for rank 0: 58 59 60 61 62 63
Steps:   0%|                                                                                                                                  | 0/10000 [00:03<?, ?it/s, lr=2e-5, step_loss=0.74]
[rank0]:[W1018 19:06:03.954239992 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Here is how my custom data loader looks like.

import os
import pandas as pd
import numpy as np
from PIL import Image
import torch
from torch.utils.data import Dataset, DataLoader
from glob import glob
from os.path import join
import json
import random
import cv2

class Phenobench(Dataset):
    def __init__(self, img_dir, img_size=512):
        self.source_image_dir = join(img_dir, 'plants_panoptic_train')
        self.target_image_dir= join(img_dir, 'train', 'images')
        self.source_images = sorted(glob(self.source_image_dir+'/*.png'))
        self.target_images = sorted(glob(self.target_image_dir+'/*.png'))
        assert len(self.source_images) == len(self.target_images)
        self.img_size = img_size

    def __len__(self):
        return len(self.source_images)

    def __getitem__(self, idx):

        source_filename = self.source_images[idx]
        target_filename = self.target_images[idx]
        if '05-15' in target_filename:
            prompt = "sugarbeet crops and weed plants of different species with dark green colored leaves from early growth stages with sunny lighting conditions in the morning and dry darker brown soil background"
        elif '05-26' in target_filename:
            prompt = "sugarbeet crops and weed plants of different species with dark green colored leaves from early stages with sunny lighting conditions in the afternoon and dry lighter brown soil background"
        elif '06-05' in target_filename:
            prompt = "sugarbeet crops and weed plants of different species with dark green colored leaves from later growth stages with overcast weather conditions without shadows and dark brown soil background with a bit of moisture"
        else:
            prompt = "None"

        source = Image.open(source_filename)
        target = Image.open(target_filename)

        source = source.resize((self.img_size, self.img_size))
        target = target.resize((self.img_size, self.img_size))
        
        source = torch.from_numpy((np.array(source) / 127.5) - 1)
        target = torch.from_numpy((np.array(target) / 127.5) - 1)

        source = source.permute(2, 0, 1)
        target = target.permute(2, 0, 1)
        return target, source, prompt

def loader(train_batch_size, num_workers, **args):
    dataset = Phenobench(**args)
    return DataLoader(dataset, batch_size=train_batch_size, num_workers=num_workers, shuffle=True)

Here is the error extracted from the logs:
Parameters which did not receive grad for rank 0: double_blocks.1.txt_mlp.2.bias, double_blocks.1.txt_mlp.2.weight, double_blocks.1.txt_mlp.0.bias, double_blocks.1.txt_mlp.0.weight, double_blocks.1.txt_attn.proj.bias, double_blocks.1.txt_attn.proj.weight

I am trying to train it on an A100-80GB GPU. If needed, I can use multiple ones as well. I can also reduce the image size. I have tried running the script using both

python3 train_flux_deepspeed_controlnet.py --config train_configs/pheno_panoptic.yaml # AND
accelerate launch train_flux_deepspeed_controlnet.py --config "train_configs/pheno_panoptic.yaml"

When I try training the LoRa, I get the same error but in a different way:
[rank0]: raise ConfigKeyError(f"Missing key {key!s}")
[rank0]: omegaconf.errors.ConfigAttributeError: Missing key double_blocks
[rank0]: full_key: double_blocks
[rank0]: object_type=dict

Kindly help in narrowing down what could be wrong. My conda environment is build using the same requirements.txt file you provided. And the models were download using the huggingface-cli token. Any help is appreciated.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant