Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

couldn't allocate MR while test GDR with cuda. #269

Open
derekwin opened this issue Jun 22, 2024 · 10 comments
Open

couldn't allocate MR while test GDR with cuda. #269

derekwin opened this issue Jun 22, 2024 · 10 comments

Comments

@derekwin
Copy link

system info

ubuntu 2204
kernel : 6.5.0-28-generic

nvidia driver and cuda version:

Driver Version: 555.42.02
CUDA Version: 12.5

I install RDMA ofed driver before installing cuda driver and cuda toolkits.

peermem module status:

nvidia_peermem         16384  0
nvidia_uvm           4943872  0
nvidia_drm            122880  0
nvidia_modeset       1368064  1 nvidia_drm
nvidia              54566912  3 nvidia_uvm,nvidia_peermem,nvidia_modeset
video                  73728  1 nvidia_modeset
ib_core               557056  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
drm_kms_helper        274432  4 ast,nvidia_drm
drm                   765952  6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm

error occured:
./ib_send_bw --use_cuda=0

Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 3E:00
CUDA device 2: PCIe address is 89:00
CUDA device 3: PCIe address is B2:00

Picking device No. 0
[pid = 3164333, dev = 0] device name = [NVIDIA GeForce RTX 4090]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007c43eac00000 pointer=0x7c43eac00000
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx

./ib_send_bw --use_cuda=0 192.168.2.244

Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 3E:00
CUDA device 2: PCIe address is 89:00
CUDA device 3: PCIe address is B2:00

Picking device No. 0
[pid = 3164350, dev = 0] device name = [NVIDIA GeForce RTX 4090]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007847fac00000 pointer=0x7847fac00000
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx
@derekwin
Copy link
Author

derekwin commented Jun 23, 2024

sry that i didn't notice this suggestion.
8. If GPUDirect is not working, (e.g. you see "Couldn't allocate MR" error message), consider disabling Scatter to CQE feature. Set the environmental variable MLX5_SCATTER_TO_CQE=0. E.g.:
MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d ib_dev --use_cuda= -a

@derekwin
Copy link
Author

after setting MLX5_SCATTER_TO_CQE=0, the problem still exist.

@derekwin derekwin reopened this Jun 23, 2024
@Jye-525
Copy link

Jye-525 commented Jul 10, 2024

@derekwin Did you solve this problem? I encountered the same error now.

@derekwin
Copy link
Author

@derekwin Did you solve this problem? I encountered the same error now.

i still have this error. : (

@Jye-525
Copy link

Jye-525 commented Jul 10, 2024

@derekwin I just solved it on my side. I test this across 2 nodes, but I only load the nvidia-peermem on one of the nodes. It is solved by loading nvidia-peermem on all the nodes. Here is the command:
sudo modprobe nvidia-peermem

You can use lsmod|grep nvidia_peermem to check if it is loaded. Hope it works for you too.

@YuMJie
Copy link

YuMJie commented Jul 18, 2024

If only one of the two machines is capable of supporting GDR, and the other is a consumer-grade graphics card, how should the bandwidth of GDR be tested?

@alokprasad
Copy link

alokprasad commented Oct 28, 2024

@derekwin which GPU are using for this? MOFED version?
Did you also tried dma_buf ?

@derekwin
Copy link
Author

@derekwin I just solved it on my side. I test this across 2 nodes, but I only load the nvidia-peermem on one of the nodes. It is solved by loading nvidia-peermem on all the nodes. Here is the command: sudo modprobe nvidia-peermem

You can use lsmod|grep nvidia_peermem to check if it is loaded. Hope it works for you too.

I am certain that the module has been correctly loaded.

nvidia_peermem         16384  0
nvidia               9736192  309 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_uverbs             196608  9 nvidia_peermem,rdma_ucm,mlx5_ib

@derekwin
Copy link
Author

@derekwin which GPU are using for this? MOFED version? Did you also tried dma_buf ?

I am using NVIDIA GeForce RTX 4090. The MOFED version is MLNX_OFED_LINUX-24.07-0.6.1.0. I have not tried dam_buf.
I reinstalled the entire system, upgrading from Ubuntu 22.04 to 24.04, and then reinstalled the latest versions of MOFED and CUDA (12.6). However, the issue persists.

@derekwin
Copy link
Author

@derekwin which GPU are using for this? MOFED version?哪个 GPU 用于此目的?MOFED 版本? Did you also tried dma_buf ?您是否也尝试过dma_buf ?

Does perftest automatically build with dma_buf support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants