-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM issue cuda 10.2 vs cuda 10.1 #895
Comments
Hello, first thing is to study whether this might be due to cudnn. You need to try two things to find out about this :
|
Also FYI bear in mind that what is reported is memory that is allocated internally by CUDA/Cudnn, not what is not available anymore. Typically Cudnn does not fully deallocate it's handles, but memory remains available somehow. |
Using
I observe :
I'll try with |
Thanks. Look at the cudnn versions that come with every cuda flavor (i.e. from the original nvidia docker image) |
I've built new images without cudnn.
It seems that the gap is lower now.... For the cudnn version it is the same for both images (cuda10_1 and cuda10_2): 7.6.5
|
Looks like a CUDNN/CUDA internal thing to me. I'd suggest you try the |
Hum I am not sure it will change anything as tensorrt uses cuda. Tensorrt7 requires cuda 10.2 and I may have seen an increase in memory usage (to verify) but I'll double check to be sure. Tensorrt does not support all the architectures yet (and I observed an important difference in predictions between caffe and tensorrt for a refinedet model, an issue will be raised soon) and I have some dependencies to use caffe before upgrading to tensorrt. Note that the issue related to the flag |
You can use the following script to create two images with DD and tensorrt as backend, One with cuda 10.1 and the other with cuda 10.2.
We observe a 10% increase in memory usage for cuda 10.2 and as explained before I still need to use caffe for some spefific models not supported yet by tensorrt. Did you observe the same thing on your side? |
It might also come from caffe not supporting correctly cuda 10.2, I am just throwing out some ideas... |
You can look for yourself, it's basic cudnn calls: https://github.com/jolibrain/caffe/blob/master/src/caffe/layers/cudnn_conv_layer.cu and we recently updated for CUDNN8, jolibrain/caffe#75 If you doubt the implementation you'll see ours (e.g. with cudnn8) is similar to that of OpenCV, opencv/opencv#17685 and opencv/opencv#17496 If you'd like to digg further, you can find NVidia useless answer to our findings that cudnn doesn't free the handles: https://forums.developer.nvidia.com/t/cudnn-create-handle-t-usage-and-memory-reuse/111257 The memory issue is even worse with CUDA 11.1, see jolibrain/caffe#78 You may want to try your docker with CUDA 11.1 + CUDNN8 since in all cases this is the new present/future. Regarding tensorrt, unless you are using LSTM layers, all training layers supported in DD are supported at the moment, AFAIK. |
Thanks for all this information, it seems weird as I did not observe this kind of memory usage using other DL frameworks. I will have a deeper look. I never tried training with your tensorrt backend as it seems fairly new and we experienced some bugs with it on the inference side. Moreover according to the documentation it only supports image connector. |
We don't have more info than what's in the code and CUDNN doc. The CUDNN memory issues are everywhere if you look for them. Your tests clearly indicate that it's a CUDNN issue with underlying CUDA something. CUDNN preallocates then dynamically allocate depending on underlying algorithms (FFT, Winograd, ...), you can go read about that in the CUDNN doc and elsewhere. |
it seems that cudnn 8.0.0 has fixed some problems, see |
Configuration
The following issue has been observe using a 1080, 1080 Ti and a Titan X.
4ce9277
Your question / the problem you're facing:
Since using a DD version with cuda 10.2 I have noticed a high increase in memory usage.
Compiling the same DD version with cuda 10.2 and cuda 10.1 allowed to observe an factor 3 increase of RAM usage for a googlenet.
Error message (if any) / steps to reproduce the problem:
The following scrip allows to build DD with cuda 10.2 and cuda 10.2 using the latest commit.
When the two images are available we can run those not at the same time in an instance without any usage of the GPUS.
Here is what we can observe with cuda 10.2
We then create a service with googlenet:
And observe 2804Mib used for this model.
Now if we launch a prediction
We observe an increase of 1GiB.
If we do know the same exercise using tu cuda10.1 build.
We then create a service with googlenet:
And observe 932Mib used for this model. 3X times less ram than when we used cuda 10.2
Now if we launch a prediction
We observe an increase of 1GiB. (in fine 2 times less RAM).
We observed here that using cuda10.2 vs cuda10.1 raises significantly the RAM footprint of a simple model as googlenet.
Moreover note that when we make a prediction the 1Gb allocated more is also pretty weird. Playing a bit with the requests made me realise that this is due to the flag {"gpu": true} in the POST request. If I do not use this one there is not this increase.
We do not observe any increase.
Note that I observed the same thing with cuda 10.2.
In fine, I observed two issues:
The text was updated successfully, but these errors were encountered: