You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m experiencing errors when attempting to run nccl-tests with SHARP enabled in multi-GPU per node configurations. I’m hoping to get some insights into the cause of these errors.
Environment
OS: Ubuntu 20.04.6 LTS x 4 servers with 2 x V100 GPU and ConnectX-6
I ran nccl-tests in four different cases, and I found that in Case 2, where multiple GPUs are used per node, SHARP initialization fails, and Streaming Aggregation is disabled.
Case 1: 2 GPUs (2 nodes × 1 process/node × 1 GPU/process) ... SHARP Available
Case 3: 4 GPUs (4 nodes × 1 process/node × 1 GPU/process) ... SHARP Available
Error out from Case 2 is here.
[snail01][Sep 24 16:15:29 847428][GENERAL][3745742][warn ] - Begin job id: 9370583446459220315 failed with status: No resource
[snail01:0:3745196 unique id 9370583446459220315][2024-09-24 16:15:29] ERROR Job error in sharp_get_job_data_len.
[snail01:0:3745196 - context.c:709][2024-09-24 16:15:29] ERROR sharp_get_job_data_len failed: Job error(-35)
[snail01:0:3745196 - context.c:718][2024-09-24 16:15:29] ERROR SHArP Job init error: No resource
Question: Why does SHARP get disabled with the No resource SHArP Job init error in Cases 2, where multi-GPUs are used per node? I believe that in the case of multi-GPUs per node, two SHARP jobs were being created, and the failure occured during the initialization of the second job, which might be the cause of the issue. Any insights would be appreciated!
Related: I believe this issue is related, but the information in this thread did not resolve it.
Hi nccl Team,
I’m experiencing errors when attempting to run nccl-tests with SHARP enabled in multi-GPU per node configurations. I’m hoping to get some insights into the cause of these errors.
Environment
Summary and Question
I ran nccl-tests in four different cases, and I found that in Case 2, where multiple GPUs are used per node, SHARP initialization fails, and Streaming Aggregation is disabled.
Error out from Case 2 is here.
Question: Why does SHARP get disabled with the No resource SHArP Job init error in Cases 2, where multi-GPUs are used per node? I believe that in the case of multi-GPUs per node, two SHARP jobs were being created, and the failure occured during the initialization of the second job, which might be the cause of the issue. Any insights would be appreciated!
Related: I believe this issue is related, but the information in this thread did not resolve it.
Details
Case 1: 2 GPUs (2 nodes × 1 process/node × 1 GPU/process)
SHARP works without issues in this configuration:
mpirun -n 2 --host snail01:1,snail02:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -g 1 -b 64M -e 128M -f 2
Output (SHARP enabled):
Case 2: 4 GPUs (2 nodes × 1 process/node × 2 GPUs/process)
In this configuration, SHARP gets disabled with the following error:
mpirun -n 2 --host snail01:1,snail02:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -g 2 -b 64M -e 128M -f 2
Error:
full output is here.
The performance is equivalent to when SHARP is not used.
Case 3: 4 GPUs (4 nodes × 1 process/node × 1 GPU/process)
This setup works fine, with SHARP enabled:
mpirun -n 4 --host snail01:1,snail02:1,snail03:1,snail04:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -t 1 -g 1 -b 64M -e 128M -f 2
The text was updated successfully, but these errors were encountered: