Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SHARP] Error When Running nccl-tests with multi-GPUs per node using SHARP #1460

Open
nariaki3551 opened this issue Sep 24, 2024 · 0 comments

Comments

@nariaki3551
Copy link

nariaki3551 commented Sep 24, 2024

Hi nccl Team,

I’m experiencing errors when attempting to run nccl-tests with SHARP enabled in multi-GPU per node configurations. I’m hoping to get some insights into the cause of these errors.

Environment

  • OS: Ubuntu 20.04.6 LTS x 4 servers with 2 x V100 GPU and ConnectX-6
  • HPC-X: hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64
  • SHARP
    • sharp_am: v 3.8.0
    • plugin: hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/nccl_rdma_sharp_plugin
  • nccl-tests: commit 9d26b8422ba76c098df996b96e13b8ddf3a71165

Summary and Question

I ran nccl-tests in four different cases, and I found that in Case 2, where multiple GPUs are used per node, SHARP initialization fails, and Streaming Aggregation is disabled.

  • Case 1: 2 GPUs (2 nodes × 1 process/node × 1 GPU/process) ... SHARP Available
  • Case 2: 4 GPUs (2 nodes × 1 process/node × 2 GPUs/process) ... SHARP Error
  • Case 3: 4 GPUs (4 nodes × 1 process/node × 1 GPU/process) ... SHARP Available

Error out from Case 2 is here.

[snail01][Sep 24 16:15:29 847428][GENERAL][3745742][warn ] - Begin job id: 9370583446459220315 failed with status: No resource
[snail01:0:3745196 unique id 9370583446459220315][2024-09-24 16:15:29] ERROR Job error in sharp_get_job_data_len.

[snail01:0:3745196 - context.c:709][2024-09-24 16:15:29] ERROR sharp_get_job_data_len failed: Job error(-35)
[snail01:0:3745196 - context.c:718][2024-09-24 16:15:29] ERROR SHArP Job init error: No resource

Question: Why does SHARP get disabled with the No resource SHArP Job init error in Cases 2, where multi-GPUs are used per node? I believe that in the case of multi-GPUs per node, two SHARP jobs were being created, and the failure occured during the initialization of the second job, which might be the cause of the issue. Any insights would be appreciated!

Related: I believe this issue is related, but the information in this thread did not resolve it.




Details

Case 1: 2 GPUs (2 nodes × 1 process/node × 1 GPU/process)

SHARP works without issues in this configuration:

mpirun -n 2 --host snail01:1,snail02:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -g 1 -b 64M -e 128M -f 2

Output (SHARP enabled):

# nThread 1 nGpus 1 minBytes 67108864 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3769265 on    snail01 device  0 [0x84] Tesla V100-PCIE-16GB
#  Rank  1 Group  0 Pid 2231269 on    snail02 device  0 [0x84] Tesla V100-PCIE-16GB
[snail01:0:3769265 - context.c:670][2024-09-24 16:17:01] INFO job (ID: 9370583448898977576) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01:0:3769265 - context.c:867][2024-09-24 16:17:01] INFO sharp_job_id:1    resv_key: tree_type:LLT tree_idx:0  treeID:0 caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[snail01:0:3769265 - context.c:882][2024-09-24 16:17:01] INFO sharp_job_id:1    tree_type:SAT tree_idx:1  treeID:64 caps:0x16
[snail01:0:3769265 - comm.c:400][2024-09-24 16:17:01] INFO [group#:0] job_id:1 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[snail01:0:3769265 - comm.c:400][2024-09-24 16:17:01] INFO [group#:1] job_id:1 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864      16777216     float     sum      -1   6891.2    9.74    9.74      0   6882.9    9.75    9.75      0
   134217728      33554432     float     sum      -1    13698    9.80    9.80      0    13689    9.80    9.80      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 9.77284 

Case 2: 4 GPUs (2 nodes × 1 process/node × 2 GPUs/process)

In this configuration, SHARP gets disabled with the following error:

mpirun -n 2 --host snail01:1,snail02:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -g 2 -b 64M -e 128M -f 2

Error:

[snail01][Sep 24 16:15:29 847428][GENERAL][3745742][warn ] - Begin job id: 9370583446459220315 failed with status: No resource
[snail01:0:3745196 unique id 9370583446459220315][2024-09-24 16:15:29] ERROR Job error in sharp_get_job_data_len.

[snail01:0:3745196 - context.c:709][2024-09-24 16:15:29] ERROR sharp_get_job_data_len failed: Job error(-35)
[snail01:0:3745196 - context.c:718][2024-09-24 16:15:29] ERROR SHArP Job init error: No resource

full output is here.

# nThread 1 nGpus 2 minBytes 67108864 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3745196 on    snail01 device  0 [0x84] Tesla V100-PCIE-16GB
#  Rank  1 Group  0 Pid 3745196 on    snail01 device  1 [0x89] Tesla V100S-PCIE-32GB
#  Rank  2 Group  0 Pid 2230757 on    snail02 device  0 [0x84] Tesla V100-PCIE-16GB
#  Rank  3 Group  0 Pid 2230757 on    snail02 device  1 [0x89] Tesla V100S-PCIE-32GB
[snail01:0:3745196 - context.c:670][2024-09-24 16:15:29] INFO job (ID: 9370583448952299738) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01:0:3745196 - context.c:867][2024-09-24 16:15:29] INFO sharp_job_id:1    resv_key: tree_type:LLT tree_idx:0  treeID:0 caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[snail01:0:3745196 - context.c:882][2024-09-24 16:15:29] INFO sharp_job_id:1    tree_type:SAT tree_idx:1  treeID:64 caps:0x16
[snail01:0:3745196 - comm.c:400][2024-09-24 16:15:29] INFO [group#:0] job_id:1 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[snail01:0:3745196 - comm.c:400][2024-09-24 16:15:29] INFO [group#:1] job_id:1 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[snail01:0:3745196 - context.c:670][2024-09-24 16:15:29] INFO job (ID: 9370583446459220315) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01][Sep 24 16:15:29 847428][GENERAL][3745742][warn ] - Begin job id: 9370583446459220315 failed with status: No resource
[snail01:0:3745196 unique id 9370583446459220315][2024-09-24 16:15:29] ERROR Job error in sharp_get_job_data_len.

[snail01:0:3745196 - context.c:709][2024-09-24 16:15:29] ERROR sharp_get_job_data_len failed: Job error(-35)
[snail01:0:3745196 - context.c:718][2024-09-24 16:15:29] ERROR SHArP Job init error: No resource
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864      16777216     float     sum      -1    13075    5.13    7.70      0    13175    5.09    7.64      0
   134217728      33554432     float     sum      -1    26096    5.14    7.71      0    26128    5.14    7.71      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.68984 

The performance is equivalent to when SHARP is not used.

Case 3: 4 GPUs (4 nodes × 1 process/node × 1 GPU/process)

This setup works fine, with SHARP enabled:

mpirun -n 4 --host snail01:1,snail02:1,snail03:1,snail04:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -t 1 -g 1 -b 64M -e 128M -f 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant