Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perlmutter scheduling #775

Merged

Conversation

burlen
Copy link
Collaborator

@burlen burlen commented Aug 28, 2023

Profiling work to determine the best way to use a single Perlmutter GPU node given the interplay between MPI, threads, CUDA, and NetCDF/HDF5.

resolves #772
resolves #769

@burlen burlen changed the base branch from develop to temporal_reduction_multiple_steps_per_request August 28, 2023 18:48
@burlen burlen force-pushed the perlmutter_scheduling branch from 55c34f5 to c80b3f1 Compare August 28, 2023 19:09
@burlen burlen force-pushed the perlmutter_scheduling branch from c80b3f1 to aafdcc2 Compare August 29, 2023 21:30
@burlen
Copy link
Collaborator Author

burlen commented Aug 29, 2023

this figure was made before the streaming bug was fixed!

perlmutter_cpu_threading

this figure was made before the streaming bug was fixed!

@burlen
Copy link
Collaborator Author

burlen commented Aug 29, 2023

this figure was made before the streaming bug was fixed!

perlmutter_1_node_cpu_mpi

this figure was made before the streaming bug was fixed!

@burlen burlen force-pushed the perlmutter_scheduling branch from aafdcc2 to 4e90eb2 Compare August 30, 2023 16:21
burlen added 2 commits August 30, 2023 09:52
add an algorithm property that allows threads in the thread pool to
inherit their device assignment from the down stream. This reduces
inter device data movement when chaining thread pools.
@burlen burlen force-pushed the perlmutter_scheduling branch from 4e90eb2 to 812a601 Compare August 30, 2023 16:53
@burlen burlen changed the title WIP -- Perlmutter scheduling Perlmutter scheduling Aug 30, 2023
@burlen burlen force-pushed the perlmutter_scheduling branch from bf05a56 to ebea8ff Compare August 30, 2023 19:42
@burlen
Copy link
Collaborator Author

burlen commented Aug 30, 2023

perlmutter_cpu_threading
1 node, 1 mpi rank, vary the number of threads. Take away: 2 writer threads, 4 reduce threads was the best

@burlen
Copy link
Collaborator Author

burlen commented Aug 30, 2023

perlmutter_cpu_rstream
for 1 rank, with best threading configuration vary the reduce stream size from 2 to N. Takeaway: slightly better with a stream size of 8. Similar tests on the writer showed stream size didn't make any difference.

@burlen
Copy link
Collaborator Author

burlen commented Aug 30, 2023

perlmutter_1_node_cpu_mpi
1 node. Using best threading and stream size from above, vary MPI ranks. Take away: 16 MPI ranks per node was the best. This is a GPU partition node, with a single CPU socket and 4 NIC. the CPU partiton node has 2 CPU sockets and 1 NIC. Results may differ

burlen added 7 commits August 31, 2023 09:22
was forwarding to teca_algorithm which resulted in none of the threading
related properties being picked up from the command line.
This fixes a bug introduced in 7120ecb. There the early termination
criteria was dropped from the loop that scans for completed work. Early
termination is the basis for streaming and without it we were waiting
for all work to complete before returning effictively disabling
streaming.
when the requested the numebr of threads is less than -1, use at most
this many threads. fewer may be used if there are insufficient cores
on the node.
@burlen burlen force-pushed the perlmutter_scheduling branch from ebea8ff to 1d5f415 Compare August 31, 2023 16:33
@burlen
Copy link
Collaborator Author

burlen commented Aug 31, 2023

perlmutter_cpu_gpu_threading
right: 1 node, 1 GPU, vary threads. left: CPU only. Take away: 2 wri threads, 4 reduce threads are best. same as cpu only

@burlen
Copy link
Collaborator Author

burlen commented Aug 31, 2023

perlmutter_1_node_gpu_mpi_nomps
Comparing NVIDIA MPS
https://docs.nersc.gov/systems/perlmutter/running-jobs/#oversubscribing-gpus-with-cuda-multi-process-service
Take away: MPS only helps above 2 ranks per device. Below that it didn't help

@burlen
Copy link
Collaborator Author

burlen commented Aug 31, 2023

perlmutter_1_node_gpu_cpu_mpi
comparing CPU to GPu on a single node. In the blue all ranks used a GPU. In the cyan usage was limited to 2 ranks per GPU, above 8 ranks CPU's were also used. In the red, CPU only. 2 writer threads. 4 reduce threads. stream size 8.

above 16 ranks, the number of threads are reduced (automatically) to avoid over subscription. at 32 ranks 2 writer threads, 2 reduce threads. at 64 ranks 1 writer thread, 1 reduce thread.

@burlen burlen merged commit 90500dc into temporal_reduction_multiple_steps_per_request Aug 31, 2023
@burlen burlen deleted the perlmutter_scheduling branch August 31, 2023 22:48
@burlen
Copy link
Collaborator Author

burlen commented Aug 31, 2023

@amandasd merged to your branch. some critical fixes here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant