You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built locally from tarball on a headnode with RHEL9.2. Repacked as rpm. Shipped to the cluster and installed on a shared drive for the compute nodes.
Please describe the system on which you are running
Operating system/version: RHEL9.2
Computer hardware: Diskless Cluster, Intel CPUs
Network type: Infiniband HDR
Scheduler: Slurm 24.05.01
Details of the problem
Starting a slurm-job (simple mpi-ring), using more than 1 node didn't worked (without any error - only mpirun returned exitcode 250). Hence, I've put some debug/verbosity to mpirun - result: pmix and prte dropped issues with no_plugin and internal_error during init. Strangely, the mpirun help (.../share/prte/help....) stated the prefix-directory, where I've installed the build on the headnode and not the prefix where I've installed it on the shared drive. (minor note: the help states to look into the share paths for both help files ".../share/prte" even the pmix is located in ".../share/pmix" seems to be a typo)
This origin prefix also still recognized by pmix_info.
It seems to be very similar to #9446
I've done the re-location of openmpi multiple times for previous versions of openmpi and slurm, and I'm aware to export the OPAL_PREFIX and library paths. And I've tried also to set PMIX_PREFIX and PRTE_PREFIX, as well as setting the prefix directly via mpirun --prefix. Nothing worked to far. For cross-check, I've installed the rpm at two nodes at the origin prefix - this worked.
I've also checked the rest of the environment vars, to be set properly.
Thus, I'm running out of ideas - so I wonder if this is a similar issue as #9446, which requires a fix?
Thanks and cheers
Sebastian
The text was updated successfully, but these errors were encountered:
Background information
What version of Open MPI are you using?
v5.0.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built locally from tarball on a headnode with RHEL9.2. Repacked as rpm. Shipped to the cluster and installed on a shared drive for the compute nodes.
Please describe the system on which you are running
Details of the problem
Starting a slurm-job (simple mpi-ring), using more than 1 node didn't worked (without any error - only mpirun returned exitcode 250). Hence, I've put some debug/verbosity to mpirun - result: pmix and prte dropped issues with
no_plugin
andinternal_error
during init. Strangely, the mpirun help (.../share/prte/help....) stated the prefix-directory, where I've installed the build on the headnode and not the prefix where I've installed it on the shared drive. (minor note: the help states to look into the share paths for both help files ".../share/prte" even the pmix is located in ".../share/pmix" seems to be a typo)This origin prefix also still recognized by pmix_info.
It seems to be very similar to #9446
I've done the re-location of openmpi multiple times for previous versions of openmpi and slurm, and I'm aware to export the
OPAL_PREFIX
and library paths. And I've tried also to setPMIX_PREFIX
andPRTE_PREFIX
, as well as setting the prefix directly viampirun --prefix
. Nothing worked to far. For cross-check, I've installed the rpm at two nodes at the origin prefix - this worked.I've also checked the rest of the environment vars, to be set properly.
Thus, I'm running out of ideas - so I wonder if this is a similar issue as #9446, which requires a fix?
Thanks and cheers
Sebastian
The text was updated successfully, but these errors were encountered: