Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apex not supporting CUDA 11.0? [Help me] #988

Open
MuruganR96 opened this issue Nov 7, 2020 · 24 comments
Open

apex not supporting CUDA 11.0? [Help me] #988

MuruganR96 opened this issue Nov 7, 2020 · 24 comments

Comments

@MuruganR96
Copy link

MuruganR96 commented Nov 7, 2020

My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2
As a result I can't properly install apex.

ImportError: cannot import name 'amp'

Software Versions pre-installed:

Nvidia Driver: 450.51v
CUDA: 11v
cuDNN: 8.0v
Python: 3.8
Docker: 19.03.12v
Nvidia-docker: 2.0v
NGC(Nvidia GPU Cloud) CLI: 1.15.0v

i followed this commands:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

when normally import apex, it is working.

python -c "import apex"

but in the main program, not working.

Traceback (most recent call last):
  File "train.py", line 188, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 83, in train
    from apex import amp
ImportError: cannot import name 'amp'

not importing apex module.

Please help me to solve this issue @definitelynotmcarilli @thorjohnsen @mcarilli @kexinyu @ptrblck :)

@ptrblck
Copy link
Contributor

ptrblck commented Nov 7, 2020

but I found the pytorch latest version from this website is 10.2

The latest PyTorch binaries can be installed with CUDA11.0 as shown in the install instructions.

Note that mixed-precision training is available in PyTorch directly via torch.cuda.amp as explained here and we recommend to use the native implementation.

In case you have trouble building apex, you could use a PyTorch NGC container with CUDA11.1, where PyTorch and apex are installed.

@sakaia
Copy link

sakaia commented Nov 9, 2020

@ptrblck CUDA 11.0 supports MIG. Is this feature available on PyTorch? or any tips?

I met following error

/home/sakaia/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:104: UserWarning:
A100-PCIE-40GB MIG 3g.20gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the A100-PCIE-40GB MIG 3g.20gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "toy_problem.py", line 87, in <module>
    main(args)
  File "toy_problem.py", line 41, in main
    optimizer = FusedAdam(model.parameters())
  File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6.egg/apex/optimizers/fused_adam.py", line 79, in __init__
    raise RuntimeError('apex.optimizers.FusedAdam requires cuda extensions')
RuntimeError: apex.optimizers.FusedAdam requires cuda extensions

@ptrblck
Copy link
Contributor

ptrblck commented Nov 9, 2020

MIG is not PyTorch-specific and can be enabled on your A100.

The error shows that you are using a PyTorch build, which doesn't support the necessary compute capability for your A100 (sm_80) so either install the PyTorch binaries with CUDA11.0 or build from source.

@sakaia
Copy link

sakaia commented Nov 9, 2020

Thanks, I use pip3 to install. I will switch another method.

@stas00
Copy link
Contributor

stas00 commented Nov 11, 2020

How can I tell apex to use cuda-11.0? I have both cuda-11.0 and cuda-11.1 installed and it fails to build as it doesn't find cuda-11.0:

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2020 NVIDIA Corporation
    Built on Mon_Oct_12_20:09:46_PDT_2020
    Cuda compilation tools, release 11.1, V11.1.105
    Build cuda_11.1.TC455_06.29190527_0
    from /usr/local/cuda-11.1/bin

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-yz2qpdod/setup.py", line 152, in <module>
        check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
      File "/tmp/pip-req-build-yz2qpdod/setup.py", line 102, in check_cuda_torch_binary_vs_bare_metal
        raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
    RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.0.

Also would it be possible to make apex builds on conda-forge for cuda11.0 and cuda11.1?

Thank you!

@ptrblck
Copy link
Contributor

ptrblck commented Nov 12, 2020

@stas00 you can try to use CUDA_HOME=/usr/local/cuda-11.0 to specify the wanted CUDA version.

@stas00
Copy link
Contributor

stas00 commented Nov 12, 2020

Awesome!

CUDA_HOME=/usr/local/cuda-11.0 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

but no luck building it:

    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    In file included from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/ATen/Parallel.h:149,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/extension.h:4,
                     from /tmp/pip-req-build-ngd468_f/csrc/amp_C_frontend.cpp:1:
    /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:84: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
       84 | #pragma omp parallel for if ((end - begin) >= grain_size)
          |
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1533, in _run_ninja_build
        subprocess.run(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/subprocess.py", line 512, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-ngd468_f/setup.py", line 405, in <module>
        setup(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
        _build_ext.build_ext.run(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 674, in build_extensions
        build_ext.build_extensions(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
        _build_ext.build_ext.build_extensions(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
        self._build_extensions_serial()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
        self.build_extension(ext)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
        _build_ext.build_extension(self, ext)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
        objects = self.compiler.compile(sources,
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 494, in unix_wrap_ninja_compile
        _write_ninja_file_and_compile_objects(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1252, in _write_ninja_file_and_compile_objects
        _run_ninja_build(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
        raise RuntimeError(message) from e
    RuntimeError: Error compiling objects for extension
    Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2i80o0cf/install-record.txt --single-version-externally-managed --compile --install-headers /home/stas/anaconda3/envs/main-38/include/python3.8/apex Check the logs for full command output.                                                                                           
Exception information:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 838, in install
    success = install_legacy(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/operations/install/legacy.py", line 86, in install
    raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 228, in _main
    status = self.run(options, args)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 182, in wrapper
    return func(self, options, args)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 397, in run
    installed = install_given_reqs(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/__init__.py", line 82, in install_given_reqs
    requirement.install(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 856, in install
    six.reraise(*exc.parent)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/six.py", line 703, in reraise
    raise value
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/operations/install/legacy.py", line 74, in install
    runner(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/utils/subprocess.py", line 273, in runner
    call_subprocess(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess
    raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2i80o0cf/install-record.txt --single-version-externally-managed --compile --install-headers /home/stas/anaconda3/envs/main-38/include/python3.8/apex Check the logs for full command output.

@ptrblck
Copy link
Contributor

ptrblck commented Nov 12, 2020

I don't see the error message besides that it's failing and don't know if the right CUDA version was found now.
If you've linked the versioned CUDA toolkits to /urs/local/cuda, could you recreate the symbolic links with the desired CUDA version?
Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.

@stas00
Copy link
Contributor

stas00 commented Nov 12, 2020

Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.

There is no option to do that, so I had to hack setup.py to disable the check:

diff --git a/setup.py b/setup.py
index 063b42d..9eabb49 100644
--- a/setup.py
+++ b/setup.py
@@ -91,6 +91,7 @@ def get_cuda_bare_metal_version(cuda_dir):
     return raw_output, bare_metal_major, bare_metal_minor

 def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+    return
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
[...]
Successfully installed apex-0.1

So I successfully built apex against system-wide cuda-11.1, while having pytorch w/ cuda-11.0 installed,

Yay!

And it works just fine!

Thank you, @ptrblck!

@Welllee12366
Copy link

@ptrblck When I install the apex toolkit ,I met some problems below:

    nvcc fatal   : Unsupported gpu architecture 'compute_86'
    error: command '/usr/local/cuda-11.0/bin/nvcc' failed with exit status 1
    Running setup.py install for apex ... error

And I have searched the problem on some search egine, But got no anwser.
How can I do properly on this issue?
My GPU hardware information bellow:
NVIDIA RTX 3090
And CUDA compiler info bellow:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

Thank you!

@stas00
Copy link
Contributor

stas00 commented Nov 15, 2020

I'm pretty sure you need cuda-11.1 for that - I built apex with it despite pytorch using cudatoolkit-11.0.

Once you have cuda-11.1 installed, follow the notes in #988 (comment)

@Welllee12366
Copy link

I'm pretty sure you need cuda-11.1 for that - I built apex with it despite pt using cudatoolkit-11.0

Once you have cuda-11.1 installed, follow the notes in #988 (comment)

Awesome!
Thanks for your reply, I have installed the toolkit successfully.

@stas00
Copy link
Contributor

stas00 commented Nov 15, 2020

I added a proper solution here: #997

@Qi-Chuan
Copy link

Qi-Chuan commented Jan 6, 2021

@ptrblck When I install the apex toolkit ,I met some problems below:

    nvcc fatal   : Unsupported gpu architecture 'compute_86'
    error: command '/usr/local/cuda-11.0/bin/nvcc' failed with exit status 1
    Running setup.py install for apex ... error

And I have searched the problem on some search egine, But got no anwser.
How can I do properly on this issue?
My GPU hardware information bellow:
NVIDIA RTX 3090
And CUDA compiler info bellow:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

Thank you!

Hello, I met the same problem with you. Can you tell me how you solve the problem? Thanks a lot!

@zhenhao-huang
Copy link

@stas00 hi,i havs same problem.
nvcc fatal : Unsupported gpu architecture 'compute_86'
I don't quite understand how you solved this problem?Can you show me more detail?

@stas00
Copy link
Contributor

stas00 commented Jan 16, 2021

@zhenhao-huang:

  1. Install cuda-11.1 system-wide
  2. Use this branch add an option to skip minor ver check #997
  3. add: --global-option="--skip-minor-ver-check" to the pip install apex command in that branch

@zhenhao-huang
Copy link

@stas00 Successfully installed apex-0.1,Thank you!

@v-nhandt21
Copy link

Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.

There is no option to do that, so I had to hack setup.py to disable the check:

diff --git a/setup.py b/setup.py
index 063b42d..9eabb49 100644
--- a/setup.py
+++ b/setup.py
@@ -91,6 +91,7 @@ def get_cuda_bare_metal_version(cuda_dir):
     return raw_output, bare_metal_major, bare_metal_minor

 def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+    return
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
[...]
Successfully installed apex-0.1

So I successfully built apex against system-wide cuda-11.1, while having pytorch w/ cuda-11.0 installed,

Yay!

And it works just fine!

Thank you, @ptrblck!

I was using this trick then install apex success, but I get into this error:

image

@empty-id
Copy link

empty-id commented Jun 26, 2021

Hi @stas00 , I used your branch but still get the error "nvcc fatal: unsupported gpu architecture 'compute_86'" :(

@stas00
Copy link
Contributor

stas00 commented Jun 26, 2021

@empty-id, make sure you have cuda-11.1 or higher installed and configured correctly - please see: https://huggingface.co/transformers/master/main_classes/trainer.html#possible-problem-2

@empty-id
Copy link

empty-id commented Jun 27, 2021

Now installed with cuda-11.1, but I met the following problem when I run a pytorch code with apex... @stas00

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}


#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
  struct __align__(2) __half {
    __host__ __device__ __half() { }

  protected:
    unsigned short __x;
  };

  /* All intrinsic functions are only available to nvcc compilers */
  #if defined(__CUDACC__)
    /* Definitions of intrinsics */
    __device__ __half __float2half(const float f) {
      __half val;
      asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
      return val;
    }

    __device__ float __half2float(const __half h) {
      float val;
      asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
      return val;
    }

  #endif /* defined(__CUDACC__) */
#endif /* defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS

typedef __half half;

extern "C" __global__
void func_1(half* t0, half* aten_mul_flat) {
{
  float t0_ = __half2float(t0[10240 * (((512 * blockIdx.x + threadIdx.x) / 10240) % 5) + (512 * blockIdx.x + threadIdx.x) % 10240]);
  aten_mul_flat[512 * blockIdx.x + threadIdx.x] = __float2half((t0_ * 0.5f) * ((tanhf((t0_ * 0.7978845834732056f) * ((t0_ * 0.04471499845385551f) * t0_ + 1.f))) + 1.f));
}
}

@stas00
Copy link
Contributor

stas00 commented Jun 27, 2021

Looks like the same error as reported here pytorch/pytorch#47669 (comment) which apparently has been fixed in pytorch many months back. Try pytorch-1.9.0 and if it doesn't work please file a new issue.

In general use google to search for similar errors, this is how I got the above url.

@empty-id
Copy link

empty-id commented Jun 27, 2021

@stas00 Thank you for your reply! I finally make it work now. I find your hack is not necessary. Just use torch-1.9.0-cuda11.1 to install NVIDIA/apex latest github repo is OK with cuda11.1 system-wide.

@xiao-ming-code
Copy link

Can you explain it in more detail? Create a diff file, copy the code above, and run it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants