Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ROCm (AMDGPU) support #547

Closed
wants to merge 10 commits into from
Closed

Add ROCm (AMDGPU) support #547

wants to merge 10 commits into from

Conversation

luraess
Copy link
Contributor

@luraess luraess commented Apr 11, 2022

Add AMDGPU.ROCArray support to the MPI machinery to allow for ROCm-aware MPI device pointer exchange, as suggested by @simonbyrne. Requires AMDGPU v0.3.4 to work smoothly. (Thanks to @utkinis for helping out as well)

All MPI.jl tests pass exporting JULIA_MPI_TEST_ARRAYTYPE=ROCArray with exception of:

  • one test in test_reduce.jl, one test in test_subarray.jl and the test_threads.jl test (the latter returning a UCX error);
  • the test_Ibarrier.jl produce following warning
    tag_match.c:61   UCX  WARN  unexpected tag-receive descriptor [...] was not matched
    
  • the test_sendrecv.jl warn about
    ┌ Warning: Assignment to `done` in soft scope is ambiguous because a global variable by the same name exists: `done` will be treated as a new local. 
    Disambiguate by using `local done` to suppress this warning or `global done` to assign to the existing global variable.
    └ @ /scratch/lraess/dev/MPI.jl/test/test_sendrecv.jl:75`
    
  • Failing test (parts) are currently commented out and preceded by # DEBUG comment.

MWE

Sandbox repo using the suggested addition to MPI.jl combined to AMDGPU.jl within ImpliciGlobalGrid.jl can be found at: https://github.com/luraess/ROCm-MPI. Both ROCm-aware and not implementations of AMDGPU.jl interacting with MPI.jl (within my ImplicitGlobalGrid dev fork) report success. The ROCm-aware stack was tested on CSCS's ault machine on following stack:

  • Open MPI v4.0.6rc4
  • ROCm 4.2
  • Julia 1.7.2
  • 2xVega20(16GB PCIe) GPUs (gfx906)

To be fixed before merge

Before merging the changes, one would need to search for # DEBUG in the test scripts and fix issues. Also, I updated the doc but realised that the doc in the v0.19.2 tag I forked to add the suggested changes is not up to date wrt latest. How to proceed?

@luraess
Copy link
Contributor Author

luraess commented Apr 12, 2022

What would be the way to proceed in order to include the latest doc to be able to this PR, adding in there the relevant infos for the ROCm stack? The doc relative to 0.19.2 Tag is not the latest. Thanks

@simonbyrne
Copy link
Member

It looks like this branch has diverged quite a bit from master. Can you bring it up to date?

@luraess
Copy link
Contributor Author

luraess commented Apr 12, 2022

Master branch is quite different from v0.19.2 and was currently failing while "latest" on registry (i.e. v0.19.2) works fine (reason why we added to ROCm support to v0.19.2). I am thus uncertain how to proceed. Could you please elaborate on "Can you bring it up to date?". Thanks a lot!

@simonbyrne
Copy link
Member

If you just want to add support to 0.19, then the easiest option is to create a PR against the release-0.19 branch.

If you want the changes in master, then we need to figure out whatever the issues are.

@luraess
Copy link
Contributor Author

luraess commented Apr 12, 2022

What would be most suited? I.e., what's the timeline of "releasing" master? Not that many changes were done for ROCm support so either or, or both could be fine. Also, where to get the latest doc to add the new bits into?

@vchuravy vchuravy changed the base branch from master to release-0.19 April 13, 2022 01:50
@vchuravy
Copy link
Member

I changed the base of this PR to the appropriate target to make reviewing feasible. We will need a second PR in due time created with:

git checkout lr/rocmaware
git checkout -b lr/rocmaware-0.20
git rebase -i origin/master

Normally the order of operations would be to other way around, e.g. first create a PR against master and then later rebase/backport it to a release branch.

if provided == MPI.THREAD_MULTIPLE
send_arr = collect(1.0:N)
recv_arr = zeros(N)
# DEBUG: currently failing on UCX error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you mask it out by if ArrayType != Array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can mask it out, wrapping the if provided == MPI.THREAD_MULTIPLE [...] end into the suggested if. But then the test still execute for ArrayType==ROCArray which leads to a segfault (see error_rocarray.txt).

Also, if using ArrayType==Array for testing, then the test errors as well (see error_array.txt).

Copy link
Contributor Author

@luraess luraess Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonbyrne Does the error stack tell what to search for? Seems the failure is not related to ROCm addition since it fails also with Arrays. Happy to work through the steps needed to be able to merge the ROCm additions 🙂

@luraess
Copy link
Contributor Author

luraess commented Apr 13, 2022

Thanks @vchuravy for rebasing.

Normally the order of operations would be to other way around, e.g. first create a PR against master and then later rebase/backport it to a release branch.

Now I know for the next time - sorry for the mess.

@vchuravy
Copy link
Member

One thing we should do is to setup CI for ROCM and CUDA, on the JuliaGPU buildkite. I can try to look into that.

@luraess
Copy link
Contributor Author

luraess commented Apr 14, 2022

One thing we should do is to setup CI for ROCM and CUDA, on the JuliaGPU buildkite. I can try to look into that.

That'd be great. Let me know if you need any input as well.

@vchuravy vchuravy added this to the 0.20 milestone Apr 15, 2022
@vchuravy
Copy link
Member

Can you rebase onto master? And add the correct definition to .buildkite/pipeline.yml?

@vchuravy vchuravy added the rocm label Apr 18, 2022
@luraess
Copy link
Contributor Author

luraess commented Apr 18, 2022

Sure, I'll do so @vchuravy.

@luraess
Copy link
Contributor Author

luraess commented Apr 19, 2022

#572 should close this PR without merging it as it addresses #547 (comment)

@vchuravy vchuravy closed this Apr 19, 2022
@luraess luraess deleted the lr/rocmaware branch April 19, 2022 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants