Add ROCm support (AMDGPU) #572

luraess · 2022-04-19T09:44:45Z

Add AMDGPU.ROCArray support to the MPI machinery to allow for ROCm-aware MPI device pointer exchange, as suggested by @simonbyrne.

⚠️ Note that this PR should replace and close #547 (without merging the latter) as suggested in #547 (comment)

Requires AMDGPU v0.3.5 to work smoothly. (Thanks to @utkinis for helping out as well).

Most MPI.jl tests pass exporting JULIA_MPI_TEST_ARRAYTYPE=ROCArray with exception of:

test_reduce.jl
test_subarray.jl
test_threads.jl

Failing test are currently commented out and preceded by # DEBUG comment.

The dev doc should also include references to the latest changes.

To-dos

to pass the tests, Workaround Preferences in test environment for 1.6 & 1.7 #564 is currently needed (issue MPI test do not select the correct (system) MPI implementation #561) and there seems to still be an issue running the MPI tests outside of the MPI.jl project. Would require merging (Workaround Preferences in test environment for 1.6 & 1.7 #564) before this PR.
selecting system MPI is still not super stable (Unable to locate LocalPreferences.toml outside of MPI.jl #570)
currently missing the analogy of CUDA.precompile_runtime() for AMDGPU. The feature would first need to be added to AMDGPU.jl (note that all works fine without for now)
one may want to add the CUDA-aware MPI + multi-GPU MWE to usage.md in the doc as well?
add the correct definition to .buildkite/pipeline.yml(I am waiting to hear back from the sysadmin that built the ROCm-aware OpenMPI on the test server regarding the config)

test/test_basic.jl

test/test_reduce.jl

test/test_threads.jl

luraess · 2022-04-20T15:37:07Z

Here https://github.com/luraess/ROCm-MPI is the updated Sandbox repo using the current PR together with latest AMDGPU.jl and ImplicitGlobalGrid.jl developments to solve 2D diffusion equation on multi AMD GPUs with or without ROCm-aware MPI support. Tested and running with OpenMPI so far.

vchuravy · 2022-04-26T12:17:08Z

Can you rebase on master and then add a ROCM section to .pipeline.yml?

luraess · 2022-05-02T08:13:30Z

Add ROCm test to Buildkite.

Status (cc @vchuravy) - i.e. curent issues:

Some unit tests fail in the GH pipeline related to AMDGPU because of libhsa not found (see here)
The Buildkite test fail for both CUDA and ROCm; segfaults in CUDA and The value of the MCA parameter "plm_rsh_agent" was set to a path that could not be found: plm_rsh_agent: ssh : rsh in ROCm.
Note that although the the passing Buildkite status (also in #master) is misleading as MPI build pass, but tests fail

giordano · 2022-05-02T10:49:34Z

Some unit tests fail in the GH pipeline related to AMDGPU because of libhsa not found (see here)

It's the i686 jobs that are failing, because the library isn't available for that platform: https://github.com/JuliaPackaging/Yggdrasil/blob/b249dffd4df24711384874fda84ffd926f859af9/H/hsa_rocr/build_tarballs.jl#L36-L39

luraess · 2022-05-02T10:54:07Z

because the library isn't available for that platform

Thanks for insights. What workaround could you suggest?

giordano · 2022-05-02T10:57:41Z

I have no idea whether the library can be built for that platform, but the name libhsa-runtime64 doesn't suggest so. If it can be built, you should try and add Platform("i686", "linux") on Yggdrasil, but if it can't you'd simply skip the test here.

vchuravy · 2022-05-02T14:57:46Z

@jpsamaroo can we make AMDGPU loadable everywhere?

.buildkite/pipeline.yml

Co-authored-by: Valentin Churavy <[email protected]>

vchuravy · 2022-05-02T15:22:41Z

is misleading as MPI build pass, but tests fail

Yeah that is intentional while we figure out the bugs.

luraess · 2022-05-02T15:31:03Z

The plm_rsh_agent changes are committed (dc76404). Now there are new errors (related to UCX?). I cannot interpret those that well tbh.

jpsamaroo · 2022-05-03T15:03:15Z

@jpsamaroo can we make AMDGPU loadable everywhere?

That's my intention (it loads on macOS and Windows, even though it's not supported on those OS's). Will look into the failures.

luraess · 2022-06-01T22:56:31Z

@jpsamaroo can you open a PR to AMDGPU to provide a less wordy vairant of AMDGPU.barrier_and!(queue, AMDGPU.active_kernels(AMDGPU.get_default_queue()))? and we will probably want to have that include the hip call as well.

Now there is wait(@roc ...). If the proposed variant would act as synchronize() in CUDA then it could be quite a breaking? change to the AMDGPU API which may have some broader impact. It could be good, but let's not forget consistency...

simonbyrne · 2022-06-01T23:04:46Z

I've added AMD back to buildkite on this PR, but it doesn't look like it is set up correctly. @luraess any ideas?

simonbyrne · 2022-06-02T05:28:09Z

Test failure could possibly be openucx/ucx#5485

@luraess were you able to run the tests locally?

luraess · 2022-06-02T08:38:29Z

@simonbyrne The tests are running locally for me (with exception for test_spawn.jl which errors due to allocation issues running things through srun and test_threads.jl - see below).

I asked the sysadmin if they could share some info about the config of the ROCm-aware OpenMPI build (it's OpenMPI v4.0.6rc4 with ROCm 4.3). Also, one needs Julia 1.7.

Regarding the test failure related to ipc, I had similar issue when running on non ROCm-aware MPI and found that adding export PMIX_MCA_psec=native helped.

The test still fail on test_threads.jl:

[ault20.cscs.ch:2085955] PML ucx cannot be selected
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      ault20
  Framework: pml
--------------------------------------------------------------------------
[ault20.cscs.ch:2085974] PML ucx cannot be selected
test_threads.jl: Error During Test at /scratch/lraess/dev/new_MPI/test/runtests.jl:38
  Got exception outside of a @test
  failed process: Process(`mpiexec -n 2 /users/lraess/julia_local/julia-1.7.2/bin/julia -Cnative -J/users/lraess/julia_local/julia-1.7.2/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --startup-file=no /scratch/lraess/dev/new_MPI/test/test_threads.jl`, ProcessExited(1)) [1]

vchuravy · 2022-06-02T14:09:34Z

@luraess what UCX version is your OpenMPI install using?

test/common.jl

luraess · 2022-06-02T15:04:44Z

what UCX version is your OpenMPI install using?

@vchuravy How to query?

Co-authored-by: Valentin Churavy <[email protected]>

test/common.jl

luraess · 2022-06-02T19:34:19Z

what UCX version is your OpenMPI install using?

@vchuravy Here the output of ompi_info (see ompi_info.txt) and ucx_info (see ucx_info.txt) on ault.

simonbyrne · 2022-06-02T22:47:44Z

So i think it is failing on this line:

MPI.jl/test/test_reduce.jl

Lines 93 to 97 in 5fd4180

    
           recv_arr = MPI.Reduce(view(send_arr, 2:3), op, MPI.COMM_WORLD; root=root) 
        
           if isroot 
        
               @test recv_arr isa ArrayType{T} 
        
               @test recv_arr == sz .* view(send_arr, 2:3) 
        
           end

In particular I think it is doing the conversion to MPIPtr wrong. How are contiguous views handled for ROCArrays?

luraess · 2022-06-03T06:30:54Z

src/rocm.jl

@@ -1,11 +1,11 @@
 import .AMDGPU

 function Base.cconvert(::Type{MPIPtr}, A::AMDGPU.ROCArray{T}) where T
-    Base.cconvert(Ptr{T}, A.buf.ptr) # returns DeviceBuffer
+    A


What was wrong here @simonbyrne ?

A.buf.ptr is a raw pointer, so the object could be cleaned up by GC. It also loses the offset argument required for views.

Thanks! TIL

luraess · 2022-06-03T06:31:27Z

Cool, now the tests pass!

jpsamaroo · 2022-06-03T13:36:27Z

What's up with the failing 1.7 job?

simonbyrne · 2022-06-03T13:49:33Z

ERROR: LoadError: LoadError: No GPU agents detected!
Please consider rebuilding AMDGPU

I've been seeing it intermittently. It usually goes away with on a restart though. Is it an issue with the runner?

jpsamaroo · 2022-06-03T17:42:59Z

Ahh, yes, this was an off-by-one bug in the amdgpuci.8 runner config; it should be working correctly now!

Issues resolved

simonbyrne · 2022-06-03T18:29:03Z

Thank you @luraess!

vchuravy · 2022-06-03T22:10:31Z

Fantastic work all around!

luraess and others added 8 commits April 19, 2022 00:04

Add ROCm (AMDGPU) support

7a577a7

Fix tests

3dd77fa

Fix tests

cc46cde

Fix tests

b9d5811

Update doc

24c4f48

Add doc update

4782cdf

Merge branch 'JuliaParallel:master' into lr/rocmaware-dev

9545aa8

Update doc with link to rocm scripts

ef153a2

luraess mentioned this pull request Apr 19, 2022

Add ROCm (AMDGPU) support #547

Closed

vchuravy reviewed Apr 19, 2022

View reviewed changes

test/test_basic.jl Outdated Show resolved Hide resolved

vchuravy reviewed Apr 19, 2022

View reviewed changes

test/test_reduce.jl Outdated Show resolved Hide resolved

vchuravy reviewed Apr 19, 2022

View reviewed changes

test/test_threads.jl Outdated Show resolved Hide resolved

luraess added 2 commits April 19, 2022 14:18

Add cleaner condition

971a78f

Merge branch 'JuliaParallel:master' into lr/rocmaware-dev

980ed51

luraess added 4 commits April 26, 2022 15:03

Merge branch 'JuliaParallel:master' into lr/rocmaware-dev

0fb7f7b

Merge branch 'JuliaParallel:master' into lr/rocmaware-dev

5e66d2c

Add ROCm tests

1ebb7dc

Update pipeline.yml

77e9f2c

vchuravy reviewed May 2, 2022

View reviewed changes

.buildkite/pipeline.yml Outdated Show resolved Hide resolved

Update buildkite ROCm MPI launch params

dc76404

Co-authored-by: Valentin Churavy <[email protected]>

add buildkite script

83cfe02

simonbyrne added 2 commits June 1, 2022 16:07

use latest Open MPI

fa73ba2

disable AMDGPU julia 1.6

3426a3a

try UCX 1.13-rc1

27bb633

simonbyrne reviewed Jun 2, 2022

View reviewed changes

test/common.jl Outdated Show resolved Hide resolved

vchuravy reviewed Jun 2, 2022

View reviewed changes

test/common.jl Outdated Show resolved Hide resolved

Add synchronize

83fe889

Co-authored-by: Valentin Churavy <[email protected]>

simonbyrne reviewed Jun 2, 2022

View reviewed changes

test/common.jl Outdated Show resolved Hide resolved

Update test/common.jl

28edee4

add more synchronize()

5fd4180

modify conversion to MPIPtr

727c8ea

luraess commented Jun 3, 2022

View reviewed changes

simonbyrne requested a review from vchuravy June 3, 2022 17:40

simonbyrne approved these changes Jun 3, 2022

View reviewed changes

simonbyrne merged commit a8d4d64 into JuliaParallel:master Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ROCm support (AMDGPU) #572

Add ROCm support (AMDGPU) #572

luraess commented Apr 19, 2022 •

edited

Loading

luraess commented Apr 20, 2022 •

edited

Loading

vchuravy commented Apr 26, 2022

luraess commented May 2, 2022 •

edited

Loading

giordano commented May 2, 2022

luraess commented May 2, 2022

giordano commented May 2, 2022

vchuravy commented May 2, 2022

vchuravy commented May 2, 2022

luraess commented May 2, 2022 •

edited

Loading

jpsamaroo commented May 3, 2022

luraess commented Jun 1, 2022

simonbyrne commented Jun 1, 2022 •

edited

Loading

simonbyrne commented Jun 2, 2022

luraess commented Jun 2, 2022 •

edited

Loading

vchuravy commented Jun 2, 2022

luraess commented Jun 2, 2022 •

edited

Loading

luraess commented Jun 2, 2022

simonbyrne commented Jun 2, 2022 •

edited

Loading

luraess Jun 3, 2022

simonbyrne Jun 3, 2022

luraess Jun 3, 2022

luraess commented Jun 3, 2022 •

edited

Loading

jpsamaroo commented Jun 3, 2022

simonbyrne commented Jun 3, 2022

jpsamaroo commented Jun 3, 2022 •

edited

Loading

simonbyrne commented Jun 3, 2022

vchuravy commented Jun 3, 2022

Add ROCm support (AMDGPU) #572

Add ROCm support (AMDGPU) #572

Conversation

luraess commented Apr 19, 2022 • edited Loading

To-dos

luraess commented Apr 20, 2022 • edited Loading

vchuravy commented Apr 26, 2022

luraess commented May 2, 2022 • edited Loading

giordano commented May 2, 2022

luraess commented May 2, 2022

giordano commented May 2, 2022

vchuravy commented May 2, 2022

vchuravy commented May 2, 2022

luraess commented May 2, 2022 • edited Loading

jpsamaroo commented May 3, 2022

luraess commented Jun 1, 2022

simonbyrne commented Jun 1, 2022 • edited Loading

simonbyrne commented Jun 2, 2022

luraess commented Jun 2, 2022 • edited Loading

vchuravy commented Jun 2, 2022

luraess commented Jun 2, 2022 • edited Loading

luraess commented Jun 2, 2022

simonbyrne commented Jun 2, 2022 • edited Loading

luraess Jun 3, 2022

Choose a reason for hiding this comment

simonbyrne Jun 3, 2022

Choose a reason for hiding this comment

luraess Jun 3, 2022

Choose a reason for hiding this comment

luraess commented Jun 3, 2022 • edited Loading

jpsamaroo commented Jun 3, 2022

simonbyrne commented Jun 3, 2022

jpsamaroo commented Jun 3, 2022 • edited Loading

simonbyrne commented Jun 3, 2022

vchuravy commented Jun 3, 2022

luraess commented Apr 19, 2022 •

edited

Loading

luraess commented Apr 20, 2022 •

edited

Loading

luraess commented May 2, 2022 •

edited

Loading

luraess commented May 2, 2022 •

edited

Loading

simonbyrne commented Jun 1, 2022 •

edited

Loading

luraess commented Jun 2, 2022 •

edited

Loading

luraess commented Jun 2, 2022 •

edited

Loading

simonbyrne commented Jun 2, 2022 •

edited

Loading

luraess commented Jun 3, 2022 •

edited

Loading

jpsamaroo commented Jun 3, 2022 •

edited

Loading