-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using COLLNET failed with sharp plugin #1219
Comments
As the log shows: device mlx5_0 port 1 is not valid (port is used by SM) |
Maybe try adding |
I tried the env: -x SHARP_ALLOW_SM_PORT=1 [C25L18:0:6928 - context.c:695] INFO job (ID: 1201360872013076329) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) The issue is the same with the one which is raised by me also. Thank you for your advance help. |
HI @AddyLaddy, |
The error log shows that the sharp_coll_comm_init in ncclSharpConnect What does this log mean? I can run the sharp_hello successfully.
Can anyone give some help? many thanks!!! |
I can run sharp_hello successfully. [root@C25L18 device]# $HPCX_SHARP_DIR/bin/sharp_hello -d mlx5_0:1 -v 3 [C25L18:0:10352 - context.c:695] INFO job (ID: 1201355100769761442) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [C25L18:0:10352 - context.c:885] INFO sharp_job_id:15 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:153 user_data_per_ost:1024 max_groups:153 max_qps:1 max_group_channels:1) [C25L18:0:10352 - comm.c:397] INFO [group#:0] job_id:15 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x10000000000) mlid:c009 Test Passed. [root@C25L18 device]# |
I met the same issue during either distributed traininig or nccl-tests when adding the sharp paras:
Issues:
All the nodes can pass sharp_hello tests. |
Seems the same issue, the error occurs when calling the sharp_coll_comm_init in ncclSharpConnect. |
I have same issue. I use same docker image to run nccl-test and training job ,nccl-test is normal, but trainingjob get nccl log :
|
@RyoYang @shanleo1986 can you run nccl-test right? I can run nccl-test normal, but trainingjob with error log |
|
I unset NCCL_ALGO, get same error. |
Update: |
I can run the IB sharp using NCCL_IB_HCA=mlx5_0, but cannot work based on multiple IB cards. |
Hi, I have run all_reduce_perf using sharp plugin with the two params:
-x NCCL_COLLNET_ENABLE=1
-x NCCL_ALGO=CollNet \
The opensm master has the sharp_am service:
service sharp_am status
Redirecting to /bin/systemctl status sharp_am.service
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.0.0
Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/sharp_am.service.d
└─Service.conf
Active: active (running) since Fri 2024-03-15 09:51:55 CST; 12min ago
Main PID: 17459 (sharp_am)
Tasks: 40 (limit: 26213)
Memory: 26.3M
CGroup: /system.slice/sharp_am.service
└─17459 /opt/hpc/software/mpi/hpcx/v2.12.0/sharp/bin/sharp_am -O -/opt/hpc/software/mpi/hpcx/v2.12.0/sharp/conf/sharp_am.cfg
And from the SHARP log, I can see the SHARP has been initialized successfully.
But there are some error logs when running the test with SHARP:
[C25L19:1:28576 unique id 1201360626793454896] ERROR collect_ports_data: device mlx5_0 port 1 is not valid (port is used by SM)
[C25L19:1:28576 unique id 1201360626793454896] ERROR collect_ports_data: failed to find valid ports
[C25L19:1:28576 unique id 1201360626793454896] ERROR sharp_get_local_data: error retrieving local data for process number 1
[C25L19:1:28576 - context.c:415] ERROR sharp_get_local_data failed: Could not open any IB device(-47)
[C25L18:0:29488 - context.c:433] ERROR OOB Gather failed on comm world, ret:3. rank:0
[C25L18:0:29488 - context.c:635] ERROR empty proceseses data ..
[C25L18:0:29491 - context.c:702] INFO job (ID: 1201360608447302825) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[C25L18:0:29491 - context.c:895] INFO sharp_job_id:28 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1)
[C25L18:0:29491 - context.c:899] INFO sharp_job_id:28 tree_type:SAT tree_idx:1 treeID:0x40 caps:0x36
[C25L18:0:29491 - comm.c:403] INFO [group#:0] job_id:28 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[C25L18:0:29491 - comm.c:403] INFO [group#:1] job_id:28 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
Can you help to help me with case why this error occures and how to resolve it?
Thanks a lot.
The text was updated successfully, but these errors were encountered: