Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491

Open
nghtm opened this issue Nov 12, 2024 · 1 comment
Assignees

Comments

@nghtm
Copy link
Collaborator

nghtm commented Nov 12, 2024

Following instructions in HyperPod EKS workshop, trying to run FSDP EKS example on 2 p5 nodes is failing with the following error, pointing towards error in train.py:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsdp/train.py", line 281, in <module>
        main(args)if hasattr(module, attr):

  File "/fsdp/train.py", line 168, in main
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    model = AutoModelForCausalLM.from_config(model_config)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 439, in from_config
    model_class = _get_model_class(config, cls._model_mapping)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 388, in _get_model_class
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    supported_models = model_mapping[type(config)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 763, in __getitem__
    return self._load_attr_from_module(model_type, model_name)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 777, in _load_attr_from_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
    return getattribute_from_module(self._modules[module_name], attr)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
    if hasattr(module, attr):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
[2024-11-12 02:26:41,444] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1544) of binary: /usr/bin/python3

@nghtm
Copy link
Collaborator Author

nghtm commented Nov 21, 2024

Suspect this is due to an issue in the underlying docker container used in FSDP example. Needs further investigation.

cc @sean-smith

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants