parallel distributed lazy for big(er) meshes #599

anderson2981 · 2022-02-01T03:16:06Z

Parallel distributed lazy runs out of memory during compilation on a 3M element mesh, 3D, second order.

Node counts of 1, 2, 4, 8 were tried on lassen. With 4 ranks (gpus) utilized on each node.

This mesh will run on 4 nodes in eager mode.

The data sets can be found in the distributed lazy branch of the isolator driver

in the test_mesh subdirectory. halfX is linked to the 3M element mesh, eigthX and quarterX are smaller meshes which run successfully. There is also a 21M element mesh located in the oneX directory.

The meshes are not checked into the repo, but can be generated with the make_mesh.sh scripts located in data/3D/. These scripts require gmsh 4.9.3 to be installed.

matthiasdiener · 2022-02-01T14:50:57Z

I believe this is the error message?

  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/pytato/pytato/distributed.py", line 660, in execute_distributed_partition
    exec_ready_part(partition.parts[pid])
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/pytato/pytato/distributed.py", line 615, in exec_ready_part
    _evt, result_dict = prg_per_partition[part.pid](queue, **inputs)
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/pytato/pytato/target/loopy/__init__.py", line 146, in __call__
    return self.program(queue,
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/loopy/loopy/translation_unit.py", line 342, in __call__
    return pex(*args, **kwargs)
  File "/p/gpfs1/manders/work/CEESD/Drivers/CEESD-Y2-isolator-parallel/emirge/loopy/loopy/target/pyopencl_execution.py", line 367, in __call__
    return translation_unit_info.invoker(
  File "/usr/WS1/xpacc/Users/manders/software/Install/Lassen/Conda/envs/mirgeDriver.Y2isolator-parallel/lib/python3.9/site-packages/pytools/py_codegen.py", line 150, in __call__
    return self.func(*args, **kwargs)
  File "<generated code for 'invoke__pt_kernel_loopy_kernel'>", line 248, in invoke__pt_kernel_loopy_kernel
  File "<generated code for 'invoke__pt_kernel_loopy_kernel'>", line 31, in _lpy_host__pt_kernel
pyopencl._cl.RuntimeError: clEnqueueNDRangeKernel failed: OUT_OF_RESOURCES

cc: @kaushikcfd @inducer @MTCam

anderson2981 · 2022-02-01T14:56:55Z

Yup, that's the one I usually get.

I ran on 16 nodes last night and got this error message instead
CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed

matthiasdiener · 2022-02-04T14:58:27Z

Unfortunately, the com-bozzle case does not seem to reproduce this issue:

Com-bozzle (mixalot.py), GPU, porter

	1 rank	2 ranks
Eager	❌ (OUT_OF_RESOURCES)	❌ (OUT_OF_RESOURCES)
Lazy	✅	✅

MTCam · 2022-02-04T16:00:22Z

Unfortunately, the com-bozzle case does not seem to reproduce this issue:

Agreed. For some reason Eager tops out at ~23k elements, but lazy is able to run all the way out to 444k elements without seeing this issue. For a failing lazy case in mixalot.py, set xscale=yscale=zscale=16.

matthiasdiener · 2022-02-04T20:05:58Z

An interesting test case for mixalot.py is

    x_scale = 8
    y_scale = 8
    z_scale = 16

which succeeds on one rank but fails with OOM on two ranks.

matthiasdiener · 2022-02-08T19:39:42Z

inducer/pytato#258 should help with memory consumption

anderson2981 · 2022-03-09T16:51:07Z

This seems to be resolve with the 3M element mesh (halfX), it runs on 2 nodes (8 ranks) on lassen.

I am still having problems with the oneX mesh, ~20M elements. I get the following message CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid when using 8 nodes (32 ranks). This problem does run with eager for that parallel decomp.

Edit:
The full error message is:

2022-03-09 09:08:33,561 - INFO - arraycontext.impl.pytato.compile - transform_loopy_program: completed (0.43s wall 1.00x CPU)
[2022-03-09 17:08:33.576405046] POCL: in fn load_or_generate_kernel at line 983:
  *** INFO ***  |      CUDA |  cuModuleLoadDataEx(/tmp/manders/pocl-cache/25472/OE/IDCNKDEAGNJAALJAANLCHPDBHPPGIMFBBEPCK/_pt_kernel/0-0-0/parallel.bc.ptx) log:
[2022-03-09 17:08:33.576438570] POCL: in fn load_or_generate_kernel at line 986:
  *** INFO ***  |      CUDA | Error during cuModuleLoadDataEx
CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid

In test_mesh/lazy/oneX there are run_params.yaml and runLassenBatch.sh that will reproduce the issue on lassen.

~~I am current running with export POCL_DEBUG=cuda and will post update information when it finishes.~~

anderson2981 · 2022-04-08T14:16:12Z

Running with the newest fusion array contractor, I get this error when compiling isolator_injection_run.py.

error

kaushikcfd · 2022-04-08T16:04:58Z

Running with the newest fusion array contractor, I get this error when compiling isolator_injection_run.py.

I tried running it with the instructions (smoke_injection_2d) in illinois-ceesd/drivers_y2-isolator#14, could not reproduce it. Could you put in the instructions that would reproduce this.

anderson2981 · 2022-04-08T19:29:17Z

I think I fixed the issue, at least temporarily. I updated the instructions linked above to include descriptions for using the production scale meshes. Currently none of these meshes/runs are able to be complied with lazy. They fail with the error message
CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed

kaushikcfd · 2022-04-08T21:40:39Z

nspecies=7 is still an issue for those branches. I'll take a look at what could be done for those cases.

matthiasdiener · 2022-04-26T13:52:26Z

The following two branches should address the arg size issue on Nvidia:

In particular, I was able to run drivers_y2-isolator/smoke_test_injection_3d with nspecies=7 successfully on Nvidia GPUs with these branches.

anderson2981 assigned matthiasdiener Feb 1, 2022

matthiasdiener added the Lazy-eval label Feb 3, 2022

MTCam mentioned this issue Feb 7, 2022

Scaling of parallel distributed lazy #602

Open

matthiasdiener mentioned this issue Feb 8, 2022

distributed-lazy: delete unneeded temporaries after execution inducer/pytato#258

Merged

kaushikcfd mentioned this issue Apr 9, 2022

Packing all part outputs/inputs into a single buffer inducer/pytato#305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel distributed lazy for big(er) meshes #599

parallel distributed lazy for big(er) meshes #599

anderson2981 commented Feb 1, 2022

matthiasdiener commented Feb 1, 2022 •

edited

Loading

anderson2981 commented Feb 1, 2022

matthiasdiener commented Feb 4, 2022 •

edited

Loading

MTCam commented Feb 4, 2022

matthiasdiener commented Feb 4, 2022

matthiasdiener commented Feb 8, 2022

anderson2981 commented Mar 9, 2022 •

edited by matthiasdiener

Loading

anderson2981 commented Apr 8, 2022

kaushikcfd commented Apr 8, 2022

anderson2981 commented Apr 8, 2022

kaushikcfd commented Apr 8, 2022

matthiasdiener commented Apr 26, 2022

parallel distributed lazy for big(er) meshes #599

parallel distributed lazy for big(er) meshes #599

Comments

anderson2981 commented Feb 1, 2022

matthiasdiener commented Feb 1, 2022 • edited Loading

anderson2981 commented Feb 1, 2022

matthiasdiener commented Feb 4, 2022 • edited Loading

MTCam commented Feb 4, 2022

matthiasdiener commented Feb 4, 2022

matthiasdiener commented Feb 8, 2022

anderson2981 commented Mar 9, 2022 • edited by matthiasdiener Loading

anderson2981 commented Apr 8, 2022

kaushikcfd commented Apr 8, 2022

anderson2981 commented Apr 8, 2022

kaushikcfd commented Apr 8, 2022

matthiasdiener commented Apr 26, 2022

matthiasdiener commented Feb 1, 2022 •

edited

Loading

matthiasdiener commented Feb 4, 2022 •

edited

Loading

anderson2981 commented Mar 9, 2022 •

edited by matthiasdiener

Loading