RuntimeError: CUDA error: invalid device function #982

tigerccx · 2020-10-20T08:24:49Z

I am trying to run this github project and I encountered a CUDA error with apex.

`Traceback (most recent call last):
File "train_AEI.py", line 132, in
scaled_loss.backward()
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/contextlib.py", line 119, in exit
next(self.gen)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 128, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:111)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7679444193 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x1270 (0x7f7668c39ce0 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x829 (0x7f7668c37c99 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: + 0x25e5a (0x7f7668c27e5a in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: + 0x1f641 (0x7f7668c21641 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)

frame #35: __libc_start_main + 0xf0 (0x7f767da69840 in /lib/x86_64-linux-gnu/libc.so.6)

Segmentation fault (core dumped)`

What could be the problem?

ptrblck · 2020-10-28T01:12:24Z

You might be using an older apex version, which didn't have the device guards for multi_tensor_apply.
Note that we recommend to use the native mixed-precision implementation as explained here.

tigerccx · 2020-10-29T01:56:02Z

@ptrblck Thank you for your answer. I followed the installation guide. So where can I get a newer version of apex? Or maybe is it because I am using CUDA 10.0 so apex was complied into an older version automatically?

ptrblck · 2020-10-29T06:09:37Z

You should get an error, if you are trying to compile apex with another CUDA version than used to compile or build PyTorch.
However, the native mixed-precision training works out of the box in PyTorch without building a 3rd party package.

tigerccx · 2020-10-30T13:01:27Z

@ptrblck My server was installed with CUDA10.0 (as displayed in nvcc -V) and PyTorch 1.4.0+cu100. So the versions should be matching. And it is not quite possible for me to update CUDA so I cannot access a newer version of PyTorch with the integration of cuda.amp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid device function #982

RuntimeError: CUDA error: invalid device function #982

tigerccx commented Oct 20, 2020

ptrblck commented Oct 28, 2020

tigerccx commented Oct 29, 2020

ptrblck commented Oct 29, 2020

tigerccx commented Oct 30, 2020 •

edited

Loading

RuntimeError: CUDA error: invalid device function #982

RuntimeError: CUDA error: invalid device function #982

Comments

tigerccx commented Oct 20, 2020

ptrblck commented Oct 28, 2020

tigerccx commented Oct 29, 2020

ptrblck commented Oct 29, 2020

tigerccx commented Oct 30, 2020 • edited Loading

tigerccx commented Oct 30, 2020 •

edited

Loading