-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704
Comments
@akesandgren, can you please attach the mentioned log files? |
Exactly which ones do you mean? |
in the original description you mentioned the following:
do you have those logs available? |
The config output yes, not the other one, i missed removing those lines originally. The interesting part here is how two different tasks on the same node can end up with different outcomes on the choice of ucx or not. How it is actually built shouldn't really matter, it's a standard build with:
|
Ok, I see. Can you please upload outputs of both bad and good runs? Also please set UCX_LOG_LEVEL=debug when running the apps (even the release UCX version contains some debug traces) |
Here's a tar file with a correct and a fail output. Both cases run on the same node. |
thanks for the logs. The issue seems to be similar to #8511. There are the following errors when the job fails:
You are not using containers, right? |
No, not using containers. From the node I've been running on:
UCX_POSIX_USE_PROC_LINK=n does not fix the problem. UCX 1.11.2 does not seem to have this problem, at least I've been unable to trigger it (still using OpenMPI 4.1.1) |
A similar issue was fixed by open-mpi/ompi#9505 which is part of OpenMPI 4.1.2 and above |
Describe the bug
Sometimes, not always, when running OpenMPI 4.1.1 with UCX 1.10.1 the pml ucx component fails to find mlx5_core on some of the tasks in a single node run.
Steps to Reproduce
Setup and versions
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXWhat happens is that for a single node job the diff between two tasks for the pml ucx component is this:
Has this been fixed in later versions and if so which commit(s) are involved?
The text was updated successfully, but these errors were encountered: