Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot install apex on the machine of CUDA 12.2 #1761

Open
momo1986 opened this issue Dec 21, 2023 · 10 comments
Open

Cannot install apex on the machine of CUDA 12.2 #1761

momo1986 opened this issue Dec 21, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@momo1986
Copy link

Describe the Bug

Minimal Steps/Code to Reproduce the Bug
running script:
"python setup.py install --cpp_ext --cuda_ext"

The reporting log:
"torch.version = 2.1.2+cu121

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
from /usr/bin

Traceback (most recent call last):
File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 178, in
check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 40, in check_cuda_torch_binary_vs_bare_metal
raise RuntimeError(
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
In some cases, a minor-version mismatch will not cause later errors: #323 (comment). You can try commenting out this check (at your own risk)."

CUDA Version is 12.2.

Expected Behavior
Install apex successfully
Environment
uname -a
Linux ps 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
Fri Dec 22 00:15:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |

@momo1986 momo1986 added the bug Something isn't working label Dec 21, 2023
@foreverpiano
Copy link

same issue

@caseclose
Copy link

caseclose commented Feb 3, 2024

Similar issue:

My GPU version is also CUDA 12.2. Installing apex directly results in the same error as mentioned above.

Then I switched to a conda virtual environment with CUDA version 11.3. My Torch version corresponds to CUDA 11.3, which is PyTorch 1.10. After that, using pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ installs apex successfully. However, when running the code, an error occurs:

Traceback (most recent call last):
  File ".../VALOR/./train.py", line 88, in <module>
    main(args)
  File ".../VALOR/./train.py", line 55, in main
    model = VALOR.from_pretrained(opts,checkpoint)
  File ".../VALOR/model/modeling.py", line 109, in from_pretrained
    model = cls(opts, *inputs, **kwargs)
  File ".../VALOR/model/pretrain.py", line 67, in __init__
    super().__init__(opts)
  File ".../VALOR/model/modeling.py", line 328, in __init__
    self.load_ast_model(base_cfg,config)
  File ".../VALOR/model/modeling.py", line 609, in load_ast_model
    self.audio_encoder = TransformerEncoder(model_cfg_audio, mode='prenorm')
  File ".../VALOR/model/transformer.py", line 149, in __init__
    layer = TransformerLayer(config, mode)
  File ".../VALOR/model/transformer.py", line 62, in __init__
    self.layernorm1 = LayerNorm(config.hidden_size, eps=1e-12)
  File ".../anaconda3/envs/valor1/lib/python3.9/site-packages/apex/normalization/fused_layer_norm.py", line 268, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File ".../anaconda3/envs/valor1/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

'Readme' shows that we can use the option '--cuda_ext' to install fused_layer_norm_cuda, but that doesn't work.

@Tsuki0125
Copy link

same issue:
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

@Zhangwq76
Copy link

I think you can remove the check code in setup.py, then use
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

@adafok
Copy link

adafok commented May 25, 2024

I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?

@Zhangwq76
Copy link

I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?

line 39, in check_cuda_torch_binary_vs_bare_metal
# if (bare_metal_version != torch_binary_version):
# raise RuntimeError(
# "Cuda extensions are being compiled with a version of Cuda that does "
# "not match the version used to compile Pytorch binaries. "
# "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
# + "In some cases, a minor-version mismatch will not cause later errors: "
# "#323 (comment). "
# "You can try commenting out this check (at your own risk)."
# )

@yachty66
Copy link

+1

@googio
Copy link

googio commented Aug 18, 2024

I've got a quick fix for this https://github.com/googio/apex based on @Zhangwq76 solution

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/googio/apex

@AEProgrammer
Copy link

same issue: File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

meet the same issue , do you solve it? i use the cuda 12.2 with torch2.1 and i modify the version check code in setup.py and use
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
to install the apex, i install it successfully but when i use megatron i got error like not found 'fused_layer_norm_cuda'

@AEProgrammer
Copy link

Similar issue:

My GPU version is also CUDA 12.2. Installing apex directly results in the same error as mentioned above.

Then I switched to a conda virtual environment with CUDA version 11.3. My Torch version corresponds to CUDA 11.3, which is PyTorch 1.10. After that, using pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ installs apex successfully. However, when running the code, an error occurs:

Traceback (most recent call last):
File ".../VALOR/./train.py", line 88, in
main(args)
File ".../VALOR/./train.py", line 55, in main
model = VALOR.from_pretrained(opts,checkpoint)
File ".../VALOR/model/modeling.py", line 109, in from_pretrained
model = cls(opts, *inputs, **kwargs)
File ".../VALOR/model/pretrain.py", line 67, in init
super().init(opts)
File ".../VALOR/model/modeling.py", line 328, in init
self.load_ast_model(base_cfg,config)
File ".../VALOR/model/modeling.py", line 609, in load_ast_model
self.audio_encoder = TransformerEncoder(model_cfg_audio, mode='prenorm')
File ".../VALOR/model/transformer.py", line 149, in init
layer = TransformerLayer(config, mode)
File ".../VALOR/model/transformer.py", line 62, in init
self.layernorm1 = LayerNorm(config.hidden_size, eps=1e-12)
File ".../anaconda3/envs/valor1/lib/python3.9/site-packages/apex/normalization/fused_layer_norm.py", line 268, in init
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
File ".../anaconda3/envs/valor1/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'
'Readme' shows that we can use the option '--cuda_ext' to install fused_layer_norm_cuda, but that doesn't work.

meet the same issue , do you solve it? i use the cuda 12.2 with torch2.1 and i modify the version check code in setup.py and use
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
to install the apex, i install it successfully but when i use megatron i got error like not found 'fused_layer_norm_cuda'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants