Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel distributed lazy for big(er) meshes #599

Open
anderson2981 opened this issue Feb 1, 2022 · 12 comments
Open

parallel distributed lazy for big(er) meshes #599

anderson2981 opened this issue Feb 1, 2022 · 12 comments
Assignees

Comments

@anderson2981
Copy link
Contributor

Parallel distributed lazy runs out of memory during compilation on a 3M element mesh, 3D, second order.

Node counts of 1, 2, 4, 8 were tried on lassen. With 4 ranks (gpus) utilized on each node.

This mesh will run on 4 nodes in eager mode.

The data sets can be found in the distributed lazy branch of the isolator driver

isolator

in the test_mesh subdirectory. halfX is linked to the 3M element mesh, eigthX and quarterX are smaller meshes which run successfully. There is also a 21M element mesh located in the oneX directory.

The meshes are not checked into the repo, but can be generated with the make_mesh.sh scripts located in data/3D/. These scripts require gmsh 4.9.3 to be installed.

@matthiasdiener
Copy link
Member

matthiasdiener commented Feb 1, 2022

I believe this is the error message?

  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/pytato/pytato/distributed.py", line 660, in execute_distributed_partition
    exec_ready_part(partition.parts[pid])
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/pytato/pytato/distributed.py", line 615, in exec_ready_part
    _evt, result_dict = prg_per_partition[part.pid](queue, **inputs)
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/pytato/pytato/target/loopy/__init__.py", line 146, in __call__
    return self.program(queue,
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/loopy/loopy/translation_unit.py", line 342, in __call__
    return pex(*args, **kwargs)
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/loopy/loopy/target/pyopencl_execution.py", line 367, in __call__
    return translation_unit_info.invoker(
  File "/usr/WS1/xpacc/Users/manders/software/Install/Lassen/Conda/envs/mirgeDriver.Y2isolator-parallel/lib/python3.9/site-packages/pytools/py_codegen.py", line 150, in __call__
    return self.func(*args, **kwargs)
  File "<generated code for 'invoke__pt_kernel_loopy_kernel'>", line 248, in invoke__pt_kernel_loopy_kernel
  File "<generated code for 'invoke__pt_kernel_loopy_kernel'>", line 31, in _lpy_host__pt_kernel
pyopencl._cl.RuntimeError: clEnqueueNDRangeKernel failed: OUT_OF_RESOURCES

cc: @kaushikcfd @inducer @MTCam

@anderson2981
Copy link
Contributor Author

Yup, that's the one I usually get.

I ran on 16 nodes last night and got this error message instead
CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed

@matthiasdiener
Copy link
Member

matthiasdiener commented Feb 4, 2022

Unfortunately, the com-bozzle case does not seem to reproduce this issue:

  • Com-bozzle (mixalot.py), GPU, porter
1 rank 2 ranks
Eager ❌ (OUT_OF_RESOURCES) ❌ (OUT_OF_RESOURCES)
Lazy

@MTCam
Copy link
Member

MTCam commented Feb 4, 2022

Unfortunately, the com-bozzle case does not seem to reproduce this issue:

Agreed. For some reason Eager tops out at ~23k elements, but lazy is able to run all the way out to 444k elements without seeing this issue. For a failing lazy case in mixalot.py, set xscale=yscale=zscale=16.

@matthiasdiener
Copy link
Member

An interesting test case for mixalot.py is

    x_scale = 8
    y_scale = 8
    z_scale = 16

which succeeds on one rank but fails with OOM on two ranks.

@matthiasdiener
Copy link
Member

inducer/pytato#258 should help with memory consumption

@anderson2981
Copy link
Contributor Author

anderson2981 commented Mar 9, 2022

This seems to be resolve with the 3M element mesh (halfX), it runs on 2 nodes (8 ranks) on lassen.

I am still having problems with the oneX mesh, ~20M elements. I get the following message CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid when using 8 nodes (32 ranks). This problem does run with eager for that parallel decomp.

Edit:
The full error message is:

2022-03-09 09:08:33,561 - INFO - arraycontext.impl.pytato.compile - transform_loopy_program: completed (0.43s wall 1.00x CPU)
[2022-03-09 17:08:33.576405046] POCL: in fn load_or_generate_kernel at line 983:
  *** INFO ***  |      CUDA |  cuModuleLoadDataEx(/tmp/manders/pocl-cache/25472/OE/IDCNKDEAGNJAALJAANLCHPDBHPPGIMFBBEPCK/_pt_kernel/0-0-0/parallel.bc.ptx) log:
[2022-03-09 17:08:33.576438570] POCL: in fn load_or_generate_kernel at line 986:
  *** INFO ***  |      CUDA | Error during cuModuleLoadDataEx
CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid

In test_mesh/lazy/oneX there are run_params.yaml and runLassenBatch.sh that will reproduce the issue on lassen.

I am current running with export POCL_DEBUG=cuda and will post update information when it finishes.

@anderson2981
Copy link
Contributor Author

Running with the newest fusion array contractor, I get this error when compiling isolator_injection_run.py.

error

@kaushikcfd
Copy link
Collaborator

Running with the newest fusion array contractor, I get this error when compiling isolator_injection_run.py.

I tried running it with the instructions (smoke_injection_2d) in illinois-ceesd/drivers_y2-isolator#14, could not reproduce it. Could you put in the instructions that would reproduce this.

@anderson2981
Copy link
Contributor Author

I think I fixed the issue, at least temporarily. I updated the instructions linked above to include descriptions for using the production scale meshes. Currently none of these meshes/runs are able to be complied with lazy. They fail with the error message
CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed

@kaushikcfd
Copy link
Collaborator

nspecies=7 is still an issue for those branches. I'll take a look at what could be done for those cases.

@matthiasdiener
Copy link
Member

The following two branches should address the arg size issue on Nvidia:

In particular, I was able to run drivers_y2-isolator/smoke_test_injection_3d with nspecies=7 successfully on Nvidia GPUs with these branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants