Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{mpi}[NVHPC/24.1-CUDA-12.4.0] OpenMPI v4.1.6 w/ CUDA 12.3.0 #21566

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

tanmoy1989
Copy link
Contributor

(created using eb --new-pr)

@@ -0,0 +1,72 @@
name = 'OpenMPI'
version = '4.1.6'
versionsuffix = '-CUDAcore-12.4.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason for CUDAcore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I guess that was a typo from my side.

@tanmoy1989
Copy link
Contributor Author

I have addressed your comments above, thanks for checking!

@tanmoy1989 tanmoy1989 changed the title {mpi}[NVHPC/24.1-CUDA-12.4.0] OpenMPI v4.1.6 w/ CUDAcore 12.4.0 {mpi}[NVHPC/24.1-CUDA-12.4.0] OpenMPI v4.1.6 w/ CUDA 12.4.0 Oct 7, 2024
@tanmoy1989
Copy link
Contributor Author

@Thyre: Any updates on this, please?

@Thyre
Copy link
Contributor

Thyre commented Oct 23, 2024

The filename should be OpenMPI-4.1.6-NVHPC-24.1-CUDA-12.4.0.eb. This is why the checks failed for the PR.


Aside from that the PR looks fine to me. However, I'm not able to approve or merge this PR. In addition, we should at least do some testing. I hopefully can check on my machines on Friday.

@Thyre
Copy link
Contributor

Thyre commented Oct 25, 2024

NVHPC/24.1-CUDA-12.4.0 does not exist, but NVHPC/24.1-CUDA-12.3.0 does. It's probably the easiest solution to switch to CUDA 12.3.0.

@tanmoy1989 tanmoy1989 changed the title {mpi}[NVHPC/24.1-CUDA-12.4.0] OpenMPI v4.1.6 w/ CUDA 12.4.0 {mpi}[NVHPC/24.1-CUDA-12.4.0] OpenMPI v4.1.6 w/ CUDA 12.3.0 Nov 4, 2024
@tanmoy1989
Copy link
Contributor Author

I changed it to CUDA-12.3.0. But UCC-CUDA and UCX-CUDA corresponding versions are available only for CUDA-12.4.0. So that's an issue. Also, I am not sure without the versionsuffix = 'CUDAcore-12.4.0' (which I had on my initial .eb file), how the '-CUDA-%(cudaver)s' command is going to executed for those two dependencies.

@Thyre
Copy link
Contributor

Thyre commented Nov 5, 2024

I changed it to CUDA-12.3.0. But UCC-CUDA and UCX-CUDA corresponding versions are available only for CUDA-12.4.0. So that's an issue. Also, I am not sure without the versionsuffix = 'CUDAcore-12.4.0' (which I had on my initial .eb file), how the '-CUDA-%(cudaver)s' command is going to executed for those two dependencies.

Well that's annoying...
If we want to stick to officially supported CUDA versions, we can only choose CUDA 12.3.0 or CUDA 11.8.0. Since NVHPC uses 12.3.0 already, I would stick with that. The only solution I see then is to add additional versions of UCX-CUDA and UCC-CUDA for this CUDA version, which should be easy to do.

@tanmoy1989
Copy link
Contributor Author

tanmoy1989 commented Nov 5, 2024

@Thyre But EB already have UCX-CUDA based on CUDA-12.4.0 (link). I am bit confused now! Do I need to write UCX-CUDA-1.15.0 based on CUDA-12.3.0 and same for UCC-CUDA? Same for NCCL (which is a dependency of UCC-CUDA) where EB has it with CUDA-12.4.0 (link).
And I am getting an error while deploying NCCL/CUDA-12.3.0 saying "error: #error -- unsupported GNU version! gcc versions later than 12 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.". NCCL/CUDA-12.4.0 got built fine though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants