Relocation of OpenMPI prefix using slurm doesn't work - wasn't this issue fixed? #12988

langeseb · 2024-12-19T11:14:50Z

Background information

What version of Open MPI are you using?

v5.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built locally from tarball on a headnode with RHEL9.2. Repacked as rpm. Shipped to the cluster and installed on a shared drive for the compute nodes.

Please describe the system on which you are running

Operating system/version: RHEL9.2
Computer hardware: Diskless Cluster, Intel CPUs
Network type: Infiniband HDR
Scheduler: Slurm 24.05.01

Details of the problem

Starting a slurm-job (simple mpi-ring), using more than 1 node didn't worked (without any error - only mpirun returned exitcode 250). Hence, I've put some debug/verbosity to mpirun - result: pmix and prte dropped issues with no_plugin and internal_error during init. Strangely, the mpirun help (.../share/prte/help....) stated the prefix-directory, where I've installed the build on the headnode and not the prefix where I've installed it on the shared drive. (minor note: the help states to look into the share paths for both help files ".../share/prte" even the pmix is located in ".../share/pmix" seems to be a typo)
This origin prefix also still recognized by pmix_info.
It seems to be very similar to #9446
I've done the re-location of openmpi multiple times for previous versions of openmpi and slurm, and I'm aware to export the OPAL_PREFIX and library paths. And I've tried also to set PMIX_PREFIX and PRTE_PREFIX, as well as setting the prefix directly via mpirun --prefix. Nothing worked to far. For cross-check, I've installed the rpm at two nodes at the origin prefix - this worked.
I've also checked the rest of the environment vars, to be set properly.
Thus, I'm running out of ideas - so I wonder if this is a similar issue as #9446, which requires a fix?
Thanks and cheers
Sebastian

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relocation of OpenMPI prefix using slurm doesn't work - wasn't this issue fixed? #12988

Relocation of OpenMPI prefix using slurm doesn't work - wasn't this issue fixed? #12988

langeseb commented Dec 19, 2024 •

edited

Loading

Relocation of OpenMPI prefix using slurm doesn't work - wasn't this issue fixed? #12988

Relocation of OpenMPI prefix using slurm doesn't work - wasn't this issue fixed? #12988

Comments

langeseb commented Dec 19, 2024 • edited Loading

Background information

What version of Open MPI are you using?

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

langeseb commented Dec 19, 2024 •

edited

Loading