-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ROCm (AMDGPU) support #547
Conversation
What would be the way to proceed in order to include the latest doc to be able to this PR, adding in there the relevant infos for the ROCm stack? The doc relative to 0.19.2 Tag is not the latest. Thanks |
It looks like this branch has diverged quite a bit from master. Can you bring it up to date? |
Master branch is quite different from v0.19.2 and was currently failing while "latest" on registry (i.e. v0.19.2) works fine (reason why we added to ROCm support to v0.19.2). I am thus uncertain how to proceed. Could you please elaborate on "Can you bring it up to date?". Thanks a lot! |
If you just want to add support to 0.19, then the easiest option is to create a PR against the If you want the changes in |
What would be most suited? I.e., what's the timeline of "releasing" master? Not that many changes were done for ROCm support so either or, or both could be fine. Also, where to get the latest doc to add the new bits into? |
I changed the base of this PR to the appropriate target to make reviewing feasible. We will need a second PR in due time created with:
Normally the order of operations would be to other way around, e.g. first create a PR against master and then later rebase/backport it to a release branch. |
if provided == MPI.THREAD_MULTIPLE | ||
send_arr = collect(1.0:N) | ||
recv_arr = zeros(N) | ||
# DEBUG: currently failing on UCX error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you mask it out by if ArrayType != Array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can mask it out, wrapping the if provided == MPI.THREAD_MULTIPLE [...] end
into the suggested if
. But then the test still execute for ArrayType==ROCArray
which leads to a segfault (see error_rocarray.txt).
Also, if using ArrayType==Array
for testing, then the test errors as well (see error_array.txt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonbyrne Does the error stack tell what to search for? Seems the failure is not related to ROCm addition since it fails also with Array
s. Happy to work through the steps needed to be able to merge the ROCm additions 🙂
Thanks @vchuravy for rebasing.
Now I know for the next time - sorry for the mess. |
One thing we should do is to setup CI for ROCM and CUDA, on the JuliaGPU buildkite. I can try to look into that. |
That'd be great. Let me know if you need any input as well. |
Can you rebase onto master? And add the correct definition to |
Sure, I'll do so @vchuravy. |
#572 should close this PR without merging it as it addresses #547 (comment) |
Add
AMDGPU.ROCArray
support to the MPI machinery to allow for ROCm-aware MPI device pointer exchange, as suggested by @simonbyrne. Requires AMDGPU v0.3.4 to work smoothly. (Thanks to @utkinis for helping out as well)All MPI.jl tests pass exporting
JULIA_MPI_TEST_ARRAYTYPE=ROCArray
with exception of:test_reduce.jl
, one test intest_subarray.jl
and thetest_threads.jl
test (the latter returning a UCX error);test_Ibarrier.jl
produce following warningtest_sendrecv.jl
warn about# DEBUG
comment.MWE
Sandbox repo using the suggested addition to MPI.jl combined to AMDGPU.jl within ImpliciGlobalGrid.jl can be found at: https://github.com/luraess/ROCm-MPI. Both ROCm-aware and not implementations of AMDGPU.jl interacting with MPI.jl (within my ImplicitGlobalGrid dev fork) report success. The ROCm-aware stack was tested on CSCS's
ault
machine on following stack:To be fixed before merge
Before merging the changes, one would need to search for
# DEBUG
in the test scripts and fix issues. Also, I updated the doc but realised that the doc in the v0.19.2 tag I forked to add the suggested changes is not up to date wrt latest. How to proceed?