-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel distributed lazy for big(er) meshes #599
Comments
I believe this is the error message?
|
Yup, that's the one I usually get. I ran on 16 nodes last night and got this error message instead |
Unfortunately, the com-bozzle case does not seem to reproduce this issue:
|
Agreed. For some reason Eager tops out at ~23k elements, but lazy is able to run all the way out to 444k elements without seeing this issue. For a failing lazy case in |
An interesting test case for
which succeeds on one rank but fails with OOM on two ranks. |
inducer/pytato#258 should help with memory consumption |
This seems to be resolve with the 3M element mesh (halfX), it runs on 2 nodes (8 ranks) on lassen. I am still having problems with the oneX mesh, ~20M elements. I get the following message Edit:
In test_mesh/lazy/oneX there are run_params.yaml and runLassenBatch.sh that will reproduce the issue on lassen.
|
Running with the newest fusion array contractor, I get this error when compiling isolator_injection_run.py. |
I tried running it with the instructions (smoke_injection_2d) in illinois-ceesd/drivers_y2-isolator#14, could not reproduce it. Could you put in the instructions that would reproduce this. |
I think I fixed the issue, at least temporarily. I updated the instructions linked above to include descriptions for using the production scale meshes. Currently none of these meshes/runs are able to be complied with lazy. They fail with the error message |
|
The following two branches should address the arg size issue on Nvidia:
In particular, I was able to run |
Parallel distributed lazy runs out of memory during compilation on a 3M element mesh, 3D, second order.
Node counts of 1, 2, 4, 8 were tried on lassen. With 4 ranks (gpus) utilized on each node.
This mesh will run on 4 nodes in eager mode.
The data sets can be found in the distributed lazy branch of the isolator driver
isolator
in the test_mesh subdirectory. halfX is linked to the 3M element mesh, eigthX and quarterX are smaller meshes which run successfully. There is also a 21M element mesh located in the oneX directory.
The meshes are not checked into the repo, but can be generated with the make_mesh.sh scripts located in data/3D/. These scripts require gmsh 4.9.3 to be installed.
The text was updated successfully, but these errors were encountered: