-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323
Conversation
@@ -47,10 +47,9 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir): | |||
print(raw_output + "from " + cuda_dir + "/bin\n") | |||
|
|||
if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytorch supports Cuda 9.1. Not sure that minor versions must be also compared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, and 9.2 as well. Nvidia also ships Docker containers where we build Pytorch from source with cuda 10.1. In practice, I do find that sometimes minor version mismatches can cause errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my case it did not. I had to edit the file and remove this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to keep this check for safety. I will note what you said in the error message, telling people they can comment it out if necessary, but at their own risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the judge condition(i mean the if)can be other specific ones not just use the cuda version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outdated, here is an updated diff:
diff --git a/setup.py b/setup.py
index e3063be..ed88abd 100644
--- a/setup.py
+++ b/setup.py
@@ -30,6 +30,11 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
print(raw_output + "from " + cuda_dir + "/bin\n")
if (bare_metal_version != torch_binary_version):
+
+ # allow minor version mismatch
+ if bare_metal_version.major == torch_binary_version.major and bare_metal_version.minor != torch_binary_version.minor:
+ return
+
raise RuntimeError(
"Cuda extensions are being compiled with a version of Cuda that does "
"not match the version used to compile Pytorch binaries. "
@ptrblck, perhaps this could be added to setup.py
in commented out fashion so it's trivial for the user to activate? This is a safer approach than just outright return
, since it'll check that the major version match.
I have this patch pasted all over since almost all projects I'm involved in have this mismatch and it works just fine. most of the time we have no control over the environment provided by the system, like HPC, so it's not by choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even better, let's have an env var that activates this exception so that the assert message could include, run:
CUDA_MINOR_VERSION_MISMATCH_OK=1 python setup.py ...
so then the user doesn't have to change the source code at all. I'd be happy to submit a PR if it resonates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I installed CUDA 10.2 and pytorch 1.7.1 in my conda virtual env, and Cuda compilation tools is 11.1, it didn't work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.version = 2.3.0+cu121
/home/E/apex/setup.py:111: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
from /usr/local/cuda/bin
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/home/pzeng/RAG/SimCSE/apex/setup.py", line 179, in
check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
File "/home/E/apex/setup.py", line 36, in check_cuda_torch_binary_vs_bare_metal
raise RuntimeError(
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
In some cases, a minor-version mismatch will not cause later errors: #323 (comment). You can try commenting out this check (at your own risk).
Why do I get a version mismatch error even though my two major versions are the same? Later, I also directly commented out the code corresponding to the error prompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was facing the same issue. I had installed the PyTorch version with cuda 12.1 support and was using cuda 12.0 for building the apex. They don't work even though they are minor version mismatched, and after commenting out the RunTimeError message, in the later compilation stage, I ran into segmentation fault (cicc core gets dumped). I resolved the issue by reinstalling the cuda driver with version 12.1 to match my torch version.
Hi, @mcarilli, I meet this problem, too.
Sorry, I just try to delete the code, which detects the verison problem, and seems it works! |
@Yorkking This error is just raised as an additional check, as the version mismatch could raise some errors. If you don't encounter any issues, you could stick to your current setup. |
OK,I got it. Thanks very much! @ptrblck |
Hi, I have cuda8.0 and 9.0 in my machine, but the softlink cuda is linked to cuda8.0. How can I use the path to cuda9.0 to install apex? Thanks. |
Could you try to export the cuda folder corresponding to the version you would like to use via e.g.
|
This was very helpful, thank you! Why would the apex installation script not look for a matching version of cuda and gcc by itself? It was smart enough to send me here (and to require |
Mismatched cuda and pytorch versions is a potential issue that's not specific to Apex. The underlying Pytorch extension builder is what's responsible for finding Cuda. In general, it's hard to know where to look on an arbitrary system. The cuda runtime is just a bunch of binaries and shared objects, so it could be present literally anywhere. |
do not use "python setup.py install --cuda_ext --cpp_ext" |
@zhuhui1214 This will not install the CUDA extensions and thus will run slower. |
I am still confused how to solve this, (read full thread): not able to install Apex with full functionality (Python-only install well), both on PC and Google Colab (CUDA 10.1) |
very useful! |
I meet this error too. My cuda version is 10.0.130, pytorch version is 1.0.0 . |
My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2 As a result I can't properly install apex. is it safe to comment out the exception and where in the code do I comment? |
commenting out as said in #323 (comment) worked for me too..
just comment out this "if loop" in setup.py I have cuda 10.1 and torch 1.5.1 |
@griff4692 Have you resolved the issue. Me too facing same problem. Software Versions pre-installed:
Please help me to solve this issue @ptrblck sir. :) |
To procees with the installation you can edit the symbolic link under
to
and the installation proceeded. I guess you can achieve the same by properly setting some environment variable. |
Is there a way to disable this check without modifying the code? My current installation command is
|
In my case, I have comment at line.
It worked for me |
This work for me. Maybe author shoud update this command to readme file. |
but it will report NoModuleFound error when using cuda extension |
this still exists in 2023, it shouldnt |
Still in With these applied, it's possible to compile apex using cuda-toolkit 11.5 against torch 2.4.0 wheel, built using Cuda 12.1 |
The warning message was too subtle/too easy to overlook in the output of setup.py, and it really should be a hard error.
Making it a hard error should also assist with cuda driver version errors like #314. #314 resulted because the cuda driver (libcuda.so) version was 10.0, the cuda toolkit version used to compile the Pytorch binaries was 10.0 (which was fine), but the cuda toolkit version used to compile Apex was 10.1** (which triggered a PTX JIT compilation error at runtime because the 10.0 libcuda.so couldn't handle the PTX produced by the 10.1 nvcc). The PTX JIT compilation error message was cryptic and unhelpful.
However, if the toolkit version that was used to compile Pytorch binaries is too recent for the system's cuda driver version, Pytorch will raise a much more helpful error, something like
If we hard-enforce that the cuda toolkit version used to compile Apex == the cuda toolkit version used to compile Pytorch, we also ensure that if the toolkit version used to compile Apex is too new for the driver, the toolkit version used to compile Pytorch must also be too new for the driver, and therefore in such cases we will receive the helpful Pytorch error instead of the bizarre PTX JIT error.
**A warning of the mismatch between torch.version.cuda and the toolkit (nvcc) had likely been issued by the setup.py while compiling apex, but this warning had likely been overlooked, so what ended up surfacing was the PTX JIT error, which was not at all a clear indication of what had gone wrong.