Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ROCm support (AMDGPU) #572

Merged
merged 29 commits into from
Jun 3, 2022
Merged

Add ROCm support (AMDGPU) #572

merged 29 commits into from
Jun 3, 2022

Conversation

luraess
Copy link
Contributor

@luraess luraess commented Apr 19, 2022

Add AMDGPU.ROCArray support to the MPI machinery to allow for ROCm-aware MPI device pointer exchange, as suggested by @simonbyrne.

⚠️ Note that this PR should replace and close #547 (without merging the latter) as suggested in #547 (comment)

Requires AMDGPU v0.3.5 to work smoothly. (Thanks to @utkinis for helping out as well).

Most MPI.jl tests pass exporting JULIA_MPI_TEST_ARRAYTYPE=ROCArray with exception of:

  • test_reduce.jl
  • test_subarray.jl
  • test_threads.jl

Failing test are currently commented out and preceded by # DEBUG comment.

The dev doc should also include references to the latest changes.

To-dos

test/test_basic.jl Outdated Show resolved Hide resolved
test/test_reduce.jl Outdated Show resolved Hide resolved
test/test_threads.jl Outdated Show resolved Hide resolved
@luraess
Copy link
Contributor Author

luraess commented Apr 20, 2022

Here https://github.com/luraess/ROCm-MPI is the updated Sandbox repo using the current PR together with latest AMDGPU.jl and ImplicitGlobalGrid.jl developments to solve 2D diffusion equation on multi AMD GPUs with or without ROCm-aware MPI support. Tested and running with OpenMPI so far.

@vchuravy
Copy link
Member

Can you rebase on master and then add a ROCM section to .pipeline.yml?

@luraess
Copy link
Contributor Author

luraess commented May 2, 2022

Add ROCm test to Buildkite.

Status (cc @vchuravy) - i.e. curent issues:

  • Some unit tests fail in the GH pipeline related to AMDGPU because of libhsa not found (see here)
  • The Buildkite test fail for both CUDA and ROCm; segfaults in CUDA and The value of the MCA parameter "plm_rsh_agent" was set to a path that could not be found: plm_rsh_agent: ssh : rsh in ROCm.
  • Note that although the the passing Buildkite status (also in #master) is misleading as MPI build pass, but tests fail

@giordano
Copy link
Member

giordano commented May 2, 2022

Some unit tests fail in the GH pipeline related to AMDGPU because of libhsa not found (see here)

It's the i686 jobs that are failing, because the library isn't available for that platform: https://github.com/JuliaPackaging/Yggdrasil/blob/b249dffd4df24711384874fda84ffd926f859af9/H/hsa_rocr/build_tarballs.jl#L36-L39

@luraess
Copy link
Contributor Author

luraess commented May 2, 2022

because the library isn't available for that platform

Thanks for insights. What workaround could you suggest?

@giordano
Copy link
Member

giordano commented May 2, 2022

I have no idea whether the library can be built for that platform, but the name libhsa-runtime64 doesn't suggest so. If it can be built, you should try and add Platform("i686", "linux") on Yggdrasil, but if it can't you'd simply skip the test here.

@vchuravy
Copy link
Member

vchuravy commented May 2, 2022

@jpsamaroo can we make AMDGPU loadable everywhere?

.buildkite/pipeline.yml Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member

vchuravy commented May 2, 2022

is misleading as MPI build pass, but tests fail

Yeah that is intentional while we figure out the bugs.

@luraess
Copy link
Contributor Author

luraess commented May 2, 2022

The plm_rsh_agent changes are committed (dc76404). Now there are new errors (related to UCX?). I cannot interpret those that well tbh.

@jpsamaroo
Copy link
Member

@jpsamaroo can we make AMDGPU loadable everywhere?

That's my intention (it loads on macOS and Windows, even though it's not supported on those OS's). Will look into the failures.

@luraess
Copy link
Contributor Author

luraess commented Jun 1, 2022

@jpsamaroo can you open a PR to AMDGPU to provide a less wordy vairant of AMDGPU.barrier_and!(queue, AMDGPU.active_kernels(AMDGPU.get_default_queue()))? and we will probably want to have that include the hip call as well.

Now there is wait(@roc ...). If the proposed variant would act as synchronize() in CUDA then it could be quite a breaking? change to the AMDGPU API which may have some broader impact. It could be good, but let's not forget consistency...

@simonbyrne
Copy link
Member

simonbyrne commented Jun 1, 2022

I've added AMD back to buildkite on this PR, but it doesn't look like it is set up correctly. @luraess any ideas?

@simonbyrne
Copy link
Member

Test failure could possibly be openucx/ucx#5485

@luraess were you able to run the tests locally?

@luraess
Copy link
Contributor Author

luraess commented Jun 2, 2022

@simonbyrne The tests are running locally for me (with exception for test_spawn.jl which errors due to allocation issues running things through srun and test_threads.jl - see below).

I asked the sysadmin if they could share some info about the config of the ROCm-aware OpenMPI build (it's OpenMPI v4.0.6rc4 with ROCm 4.3). Also, one needs Julia 1.7.

Regarding the test failure related to ipc, I had similar issue when running on non ROCm-aware MPI and found that adding export PMIX_MCA_psec=native helped.

The test still fail on test_threads.jl:

[ault20.cscs.ch:2085955] PML ucx cannot be selected
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      ault20
  Framework: pml
--------------------------------------------------------------------------
[ault20.cscs.ch:2085974] PML ucx cannot be selected
test_threads.jl: Error During Test at /scratch/lraess/dev/new_MPI/test/runtests.jl:38
  Got exception outside of a @test
  failed process: Process(`mpiexec -n 2 /users/lraess/julia_local/julia-1.7.2/bin/julia -Cnative -J/users/lraess/julia_local/julia-1.7.2/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --startup-file=no /scratch/lraess/dev/new_MPI/test/test_threads.jl`, ProcessExited(1)) [1]

@vchuravy
Copy link
Member

vchuravy commented Jun 2, 2022

@luraess what UCX version is your OpenMPI install using?

test/common.jl Outdated Show resolved Hide resolved
test/common.jl Outdated Show resolved Hide resolved
@luraess
Copy link
Contributor Author

luraess commented Jun 2, 2022

what UCX version is your OpenMPI install using?

@vchuravy How to query?

Co-authored-by: Valentin Churavy <[email protected]>
test/common.jl Outdated Show resolved Hide resolved
@luraess
Copy link
Contributor Author

luraess commented Jun 2, 2022

what UCX version is your OpenMPI install using?

@vchuravy Here the output of ompi_info (see ompi_info.txt) and ucx_info (see ucx_info.txt) on ault.

@simonbyrne
Copy link
Member

simonbyrne commented Jun 2, 2022

So i think it is failing on this line:

recv_arr = MPI.Reduce(view(send_arr, 2:3), op, MPI.COMM_WORLD; root=root)
if isroot
@test recv_arr isa ArrayType{T}
@test recv_arr == sz .* view(send_arr, 2:3)
end

In particular I think it is doing the conversion to MPIPtr wrong. How are contiguous views handled for ROCArrays?

@@ -1,11 +1,11 @@
import .AMDGPU

function Base.cconvert(::Type{MPIPtr}, A::AMDGPU.ROCArray{T}) where T
Base.cconvert(Ptr{T}, A.buf.ptr) # returns DeviceBuffer
A
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was wrong here @simonbyrne ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A.buf.ptr is a raw pointer, so the object could be cleaned up by GC. It also loses the offset argument required for views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! TIL

@luraess
Copy link
Contributor Author

luraess commented Jun 3, 2022

Cool, now the tests pass!

@jpsamaroo
Copy link
Member

What's up with the failing 1.7 job?

@simonbyrne
Copy link
Member

ERROR: LoadError: LoadError: No GPU agents detected!
Please consider rebuilding AMDGPU

I've been seeing it intermittently. It usually goes away with on a restart though. Is it an issue with the runner?

@simonbyrne simonbyrne requested a review from vchuravy June 3, 2022 17:40
@jpsamaroo
Copy link
Member

jpsamaroo commented Jun 3, 2022

Ahh, yes, this was an off-by-one bug in the amdgpuci.8 runner config; it should be working correctly now!

@simonbyrne simonbyrne dismissed vchuravy’s stale review June 3, 2022 18:27

Issues resolved

@simonbyrne simonbyrne merged commit a8d4d64 into JuliaParallel:master Jun 3, 2022
@simonbyrne
Copy link
Member

Thank you @luraess!

@vchuravy
Copy link
Member

vchuravy commented Jun 3, 2022

Fantastic work all around!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants