Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323

Merged
merged 1 commit into from
May 22, 2019

Conversation

mcarilli
Copy link
Contributor

@mcarilli mcarilli commented May 22, 2019

The warning message was too subtle/too easy to overlook in the output of setup.py, and it really should be a hard error.

Making it a hard error should also assist with cuda driver version errors like #314. #314 resulted because the cuda driver (libcuda.so) version was 10.0, the cuda toolkit version used to compile the Pytorch binaries was 10.0 (which was fine), but the cuda toolkit version used to compile Apex was 10.1** (which triggered a PTX JIT compilation error at runtime because the 10.0 libcuda.so couldn't handle the PTX produced by the 10.1 nvcc). The PTX JIT compilation error message was cryptic and unhelpful.

However, if the toolkit version that was used to compile Pytorch binaries is too recent for the system's cuda driver version, Pytorch will raise a much more helpful error, something like

"AssertionError: 
The NVIDIA driver on your system is too old (found version 10000)..."

If we hard-enforce that the cuda toolkit version used to compile Apex == the cuda toolkit version used to compile Pytorch, we also ensure that if the toolkit version used to compile Apex is too new for the driver, the toolkit version used to compile Pytorch must also be too new for the driver, and therefore in such cases we will receive the helpful Pytorch error instead of the bizarre PTX JIT error.

**A warning of the mismatch between torch.version.cuda and the toolkit (nvcc) had likely been issued by the setup.py while compiling apex, but this warning had likely been overlooked, so what ended up surfacing was the PTX JIT error, which was not at all a clear indication of what had gone wrong.

@mcarilli mcarilli changed the title Hard error on Pytorch Cuda + Cuda toolkit version mismatch Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex May 22, 2019
@mcarilli mcarilli merged commit 50689f6 into master May 22, 2019
@mcarilli mcarilli deleted the error_mismatch branch May 22, 2019 23:34
@@ -47,10 +47,9 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
print(raw_output + "from " + cuda_dir + "/bin\n")

if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytorch supports Cuda 9.1. Not sure that minor versions must be also compared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, and 9.2 as well. Nvidia also ships Docker containers where we build Pytorch from source with cuda 10.1. In practice, I do find that sometimes minor version mismatches can cause errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case it did not. I had to edit the file and remove this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to keep this check for safety. I will note what you said in the error message, telling people they can comment it out if necessary, but at their own risk.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the judge condition(i mean the if)can be other specific ones not just use the cuda version.

Copy link
Contributor

@stas00 stas00 Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outdated, here is an updated diff:

diff --git a/setup.py b/setup.py
index e3063be..ed88abd 100644
--- a/setup.py
+++ b/setup.py
@@ -30,6 +30,11 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
     print(raw_output + "from " + cuda_dir + "/bin\n")

     if (bare_metal_version != torch_binary_version):
+
+        # allow minor version mismatch
+        if bare_metal_version.major == torch_binary_version.major and bare_metal_version.minor != torch_binary_version.minor:
+            return
+
         raise RuntimeError(
             "Cuda extensions are being compiled with a version of Cuda that does "
             "not match the version used to compile Pytorch binaries.  "

@ptrblck, perhaps this could be added to setup.py in commented out fashion so it's trivial for the user to activate? This is a safer approach than just outright return, since it'll check that the major version match.

I have this patch pasted all over since almost all projects I'm involved in have this mismatch and it works just fine. most of the time we have no control over the environment provided by the system, like HPC, so it's not by choice.

Copy link
Contributor

@stas00 stas00 Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even better, let's have an env var that activates this exception so that the assert message could include, run:

CUDA_MINOR_VERSION_MISMATCH_OK=1 python setup.py ...

so then the user doesn't have to change the source code at all. I'd be happy to submit a PR if it resonates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed CUDA 10.2 and pytorch 1.7.1 in my conda virtual env, and Cuda compilation tools is 11.1, it didn't work for me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.version = 2.3.0+cu121

/home/E/apex/setup.py:111: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
from /usr/local/cuda/bin

Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/home/pzeng/RAG/SimCSE/apex/setup.py", line 179, in
check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
File "/home/E/apex/setup.py", line 36, in check_cuda_torch_binary_vs_bare_metal
raise RuntimeError(
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
In some cases, a minor-version mismatch will not cause later errors: #323 (comment). You can try commenting out this check (at your own risk).

Why do I get a version mismatch error even though my two major versions are the same? Later, I also directly commented out the code corresponding to the error prompt.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was facing the same issue. I had installed the PyTorch version with cuda 12.1 support and was using cuda 12.0 for building the apex. They don't work even though they are minor version mismatched, and after commenting out the RunTimeError message, in the later compilation stage, I ran into segmentation fault (cicc core gets dumped). I resolved the issue by reinstalling the cuda driver with version 12.1 to match my torch version.

@Yorkking
Copy link

Yorkking commented Aug 5, 2019

Hi, @mcarilli, I meet this problem, too.
My nvcc version is cuda 10.1, but I found the pytorch latest version from this website is 10.0, so I downloaded it. But now this problem occurred in windows. Could you help me?

RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  
Pytorch binaries were compiled with Cuda 10.0.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. 
 You can try commenting out this check (at your own risk).

Sorry, I just try to delete the code, which detects the verison problem, and seems it works!

@ptrblck
Copy link
Contributor

ptrblck commented Aug 5, 2019

@Yorkking This error is just raised as an additional check, as the version mismatch could raise some errors. If you don't encounter any issues, you could stick to your current setup.
However, the recommended way would be to install a matching CUDA version.

@Yorkking
Copy link

Yorkking commented Aug 7, 2019

OK,I got it. Thanks very much! @ptrblck

@JoyHuYY1412
Copy link

@Yorkking This error is just raised as an additional check, as the version mismatch could raise some errors. If you don't encounter any issues, you could stick to your current setup.
However, the recommended way would be to install a matching CUDA version.

Hi, I have cuda8.0 and 9.0 in my machine, but the softlink cuda is linked to cuda8.0. How can I use the path to cuda9.0 to install apex? Thanks.

@ptrblck
Copy link
Contributor

ptrblck commented Aug 21, 2019

Could you try to export the cuda folder corresponding to the version you would like to use via e.g.

export CUDA_HOME=/usr/local/cuda-9.0

@Zacharias030
Copy link

Zacharias030 commented Sep 17, 2019

Could you try to export the cuda folder corresponding to the version you would like to use via e.g.

export CUDA_HOME=/usr/local/cuda-9.0

This was very helpful, thank you!

Why would the apex installation script not look for a matching version of cuda and gcc by itself? It was smart enough to send me here (and to require gcc-7 as gcc which another softlink was able to solve).

@mcarilli
Copy link
Contributor Author

Mismatched cuda and pytorch versions is a potential issue that's not specific to Apex. The underlying Pytorch extension builder is what's responsible for finding Cuda. In general, it's hard to know where to look on an arbitrary system. The cuda runtime is just a bunch of binaries and shared objects, so it could be present literally anywhere.

@zhuhui1214
Copy link

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

@ptrblck
Copy link
Contributor

ptrblck commented Dec 23, 2019

@zhuhui1214 This will not install the CUDA extensions and thus will run slower.
While this is a workaround for this issue, it's not a solution and you should e.g. make sure your local CUDA version matches the one shipped with PyTorch.

@GraphGrailAi
Copy link

I am still confused how to solve this, (read full thread): not able to install Apex with full functionality (Python-only install well), both on PC and Google Colab (CUDA 10.1)
error on PC is
No module named 'fused_layer_norm_cuda'
when try to use my code

@cosen1024
Copy link

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

very useful!

@leungi leungi mentioned this pull request May 22, 2020
@sofzh
Copy link

sofzh commented Jun 3, 2020

I meet this error too. My cuda version is 10.0.130, pytorch version is 1.0.0 .
ths error is "RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 9.0.176."

@ptrblck
Copy link
Contributor

ptrblck commented Jun 5, 2020

@sofzh
We recommend to use the native amp implementation as described here: #818

Alternatively, you would either have to install the matching CUDA version locally (9.0), build PyTorch from source using your current CUDA installation or install a PyTorch binary with a matching CUDA version.

imsky added a commit to MeetElise/apex that referenced this pull request Jun 19, 2020
@griff4692
Copy link

My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2

As a result I can't properly install apex. is it safe to comment out the exception and where in the code do I comment?

@gghati
Copy link

gghati commented Jul 1, 2020

commenting out as said in #323 (comment) worked for me too..

# if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
#     raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
#                        "not match the version used to compile Pytorch binaries.  " +
#                        "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
#                        "In some cases, a minor-version mismatch will not cause later errors:  " +
#                        "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
#                        "You can try commenting out this check (at your own risk).")

just comment out this "if loop" in setup.py

I have cuda 10.1 and torch 1.5.1
I don't face any problem as such.

@MuruganR96
Copy link

My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2

As a result I can't properly install apex. is it safe to comment out the exception and where in the code do I comment?

@griff4692 Have you resolved the issue. Me too facing same problem.

Software Versions pre-installed:

Nvidia Driver: 450.51v
CUDA: 11v
cuDNN: 8.0v
Python: 3.8
Docker: 19.03.12v
Nvidia-docker: 2.0v
NGC(Nvidia GPU Cloud) CLI: 1.15.0v
Traceback (most recent call last):
  File "train.py", line 188, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 83, in train
    from apex import amp
ImportError: cannot import name 'amp'
pip uninstall apex
cd apex
rm -rf build 
python setup.py install --cuda_ext --cpp_ext 
Traceback (most recent call last):
  File "setup.py", line 152, in <module>
    check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
  File "setup.py", line 106, in check_cuda_torch_binary_vs_bare_metal
    "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 9.0.176.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).

Please help me to solve this issue @ptrblck sir. :)

@potipot
Copy link

potipot commented Jan 24, 2021

To procees with the installation you can edit the symbolic link under /usr/local
in my case I already had different CUDA versions installed so I just changed the symbolic link
from

cuda -> cuda-10.0/

to

cuda -> cuda-10.2/

and the installation proceeded. I guess you can achieve the same by properly setting some environment variable.

@vadimkantorov
Copy link

vadimkantorov commented Feb 15, 2021

Is there a way to disable this check without modifying the code?

My current installation command is pip install cxxfilt && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex (my PyTorch is compiled with CUDA10.2, and machine has CUDA11.2)



    torch.__version__  = 1.7.1


    /tmp/pip-req-build-pd7maz9h/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2020 NVIDIA Corporation
    Built on Mon_Nov_30_19:08:53_PST_2020
    Cuda compilation tools, release 11.2, V11.2.67
    Build cuda_11.2.r11.2/compiler.29373293_0
    from /usr/local/cuda/bin

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-pd7maz9h/setup.py", line 171, in <module>
        check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
      File "/tmp/pip-req-build-pd7maz9h/setup.py", line 102, in check_cuda_torch_binary_vs_bare_metal
        raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
    RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 10.2.
    In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
    Running setup.py install for apex ... error

@vvuonghn
Copy link

vvuonghn commented Jul 7, 2021

In my case, I have comment at line.

if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
    raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
                       "not match the version used to compile Pytorch binaries.  " +
                       "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
                       "In some cases, a minor-version mismatch will not cause later errors:  " +
                       "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
                       "You can try commenting out this check (at your own risk).")

It worked for me

@AlexBlack2202
Copy link

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

This work for me. Maybe author shoud update this command to readme file.

@joeyslv
Copy link

joeyslv commented Mar 16, 2022

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

This work for me. Maybe author shoud update this command to readme file.

but it will report NoModuleFound error when using cuda extension

@MubarakHAlketbi
Copy link

this still exists in 2023, it shouldnt
we are on CUDA 12.1

@drzraf
Copy link

drzraf commented Aug 30, 2024

setup.py provides no environment for this, but torch/utils/cpp_extension.py is forwarding $CC to nvcc 's -ccbin argument (if using gcc-9 you may also have to add -allow-unsupported-compiler around line 582)

Still in torch/utils/cpp_extension.py, there is an if cuda_ver.major != torch_cuda_version.major check around line 416 than could also be removed.

With these applied, it's possible to compile apex using cuda-toolkit 11.5 against torch 2.4.0 wheel, built using Cuda 12.1
(although not sure if it'll actually load)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.