-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ROCm support (AMDGPU) #572
Conversation
Here https://github.com/luraess/ROCm-MPI is the updated Sandbox repo using the current PR together with latest AMDGPU.jl and ImplicitGlobalGrid.jl developments to solve 2D diffusion equation on multi AMD GPUs with or without ROCm-aware MPI support. Tested and running with OpenMPI so far. |
Can you rebase on master and then add a ROCM section to .pipeline.yml? |
Add ROCm test to Buildkite. Status (cc @vchuravy) - i.e. curent issues:
|
It's the i686 jobs that are failing, because the library isn't available for that platform: https://github.com/JuliaPackaging/Yggdrasil/blob/b249dffd4df24711384874fda84ffd926f859af9/H/hsa_rocr/build_tarballs.jl#L36-L39 |
Thanks for insights. What workaround could you suggest? |
I have no idea whether the library can be built for that platform, but the name |
@jpsamaroo can we make AMDGPU loadable everywhere? |
Co-authored-by: Valentin Churavy <[email protected]>
Yeah that is intentional while we figure out the bugs. |
The |
That's my intention (it loads on macOS and Windows, even though it's not supported on those OS's). Will look into the failures. |
Now there is |
I've added AMD back to buildkite on this PR, but it doesn't look like it is set up correctly. @luraess any ideas? |
Test failure could possibly be openucx/ucx#5485 @luraess were you able to run the tests locally? |
@simonbyrne The tests are running locally for me (with exception for I asked the sysadmin if they could share some info about the config of the ROCm-aware OpenMPI build (it's OpenMPI v4.0.6rc4 with ROCm 4.3). Also, one needs Julia 1.7. Regarding the test failure related to ipc, I had similar issue when running on non ROCm-aware MPI and found that adding The test still fail on
|
@luraess what UCX version is your OpenMPI install using? |
@vchuravy How to query? |
Co-authored-by: Valentin Churavy <[email protected]>
@vchuravy Here the output of |
So i think it is failing on this line: Lines 93 to 97 in 5fd4180
In particular I think it is doing the conversion to MPIPtr wrong. How are contiguous views handled for ROCArrays?
|
@@ -1,11 +1,11 @@ | |||
import .AMDGPU | |||
|
|||
function Base.cconvert(::Type{MPIPtr}, A::AMDGPU.ROCArray{T}) where T | |||
Base.cconvert(Ptr{T}, A.buf.ptr) # returns DeviceBuffer | |||
A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was wrong here @simonbyrne ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A.buf.ptr
is a raw pointer, so the object could be cleaned up by GC. It also loses the offset
argument required for views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! TIL
Cool, now the tests pass! |
What's up with the failing 1.7 job? |
I've been seeing it intermittently. It usually goes away with on a restart though. Is it an issue with the runner? |
Ahh, yes, this was an off-by-one bug in the amdgpuci.8 runner config; it should be working correctly now! |
Thank you @luraess! |
Fantastic work all around! |
Add AMDGPU.ROCArray support to the MPI machinery to allow for ROCm-aware MPI device pointer exchange, as suggested by @simonbyrne.
Requires AMDGPU v0.3.5 to work smoothly. (Thanks to @utkinis for helping out as well).
Most MPI.jl tests pass exporting
JULIA_MPI_TEST_ARRAYTYPE=ROCArray
with exception of:test_reduce.jl
test_subarray.jl
test_threads.jl
Failing test are currently commented out and preceded by
# DEBUG
comment.The
dev
doc should also include references to the latest changes.To-dos
LocalPreferences.toml
outside of MPI.jl #570)CUDA.precompile_runtime()
for AMDGPU. The feature would first need to be added to AMDGPU.jl (note that all works fine without for now)usage.md
in the doc as well?.buildkite/pipeline.yml
(I am waiting to hear back from the sysadmin that built the ROCm-aware OpenMPI on the test server regarding the config)