FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491

nghtm · 2024-11-12T02:31:50Z

Following instructions in HyperPod EKS workshop, trying to run FSDP EKS example on 2 p5 nodes is failing with the following error, pointing towards error in train.py:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsdp/train.py", line 281, in <module>
        main(args)if hasattr(module, attr):

  File "/fsdp/train.py", line 168, in main
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    model = AutoModelForCausalLM.from_config(model_config)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 439, in from_config
    model_class = _get_model_class(config, cls._model_mapping)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 388, in _get_model_class
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    supported_models = model_mapping[type(config)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 763, in __getitem__
    return self._load_attr_from_module(model_type, model_name)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 777, in _load_attr_from_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
    return getattribute_from_module(self._modules[module_name], attr)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
    if hasattr(module, attr):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
[2024-11-12 02:26:41,444] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1544) of binary: /usr/bin/python3

The text was updated successfully, but these errors were encountered:

nghtm · 2024-11-21T18:13:43Z

Suspect this is due to an issue in the underlying docker container used in FSDP example. Needs further investigation.

cc @sean-smith

nghtm assigned nghtm and iankouls-aws Nov 13, 2024

nghtm assigned sean-smith Nov 21, 2024

sean-smith mentioned this issue Nov 21, 2024

Include full Dockerfile for FSDP example #503

Closed

mnuyens mentioned this issue Nov 22, 2024

fix: Updating Dockerfile to pin versions and fix the exampel #505

Merged

nghtm assigned pbelevich Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491

FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491

nghtm commented Nov 12, 2024

nghtm commented Nov 21, 2024

FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491

FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491

Comments

nghtm commented Nov 12, 2024

nghtm commented Nov 21, 2024