Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower all-to-all communication volume in fft transposes #110

Open
JHopeCollins opened this issue Mar 28, 2023 · 3 comments · May be fixed by #123
Open

Lower all-to-all communication volume in fft transposes #110

JHopeCollins opened this issue Mar 28, 2023 · 3 comments · May be fixed by #123
Assignees
Labels
Core functionality Adding to the main paradiag functionality performance Improving or fixing the computational performance

Comments

@JHopeCollins
Copy link
Member

Currently we transpose a complex array before/after doing the fft/ifft in the preconditioner.
The forward transpose before the fft and the backward transpose after the ifft could be on real arrays, which would halve the communication volume. This would need a second pencil/transfer to be created for the real transposes.

@JHopeCollins JHopeCollins added the Core functionality Adding to the main paradiag functionality label Mar 28, 2023
@JHopeCollins JHopeCollins changed the title Lower all-to-all communication volume Lower all-to-all communication volume in fft transposes Mar 28, 2023
@JHopeCollins JHopeCollins added the performance Improving or fixing the computational performance label Jun 22, 2023
@JHopeCollins JHopeCollins self-assigned this Jun 26, 2023
@JHopeCollins JHopeCollins linked a pull request Jun 27, 2023 that will close this issue
@JHopeCollins
Copy link
Member Author

The mpi4py-fft module uses Alltoallw for the transpose, which can take different lengths and types of data on each rank. As far as I know this is rarely optimised by vendors because its so general, so is just implemented as a big isend/irecv round.
If we can change this to Alltoallv then it might improve the performance because this is more likely to be optimised for proper nlogn collective performance.

@JHopeCollins
Copy link
Member Author

JHopeCollins commented Jul 24, 2023

  • Benchmark current implementation (all complex transposes and alltoallw)
  • Benchmark reduced transpose volume (with alltoallw)
    • real-complex transposes
    • reduced precision
  • Benchmark alltoallv (all complex transposes and alltoallv)
  • Benchmark both modifications (lower communication volume and alltoallv)

@JHopeCollins
Copy link
Member Author

We could also try communicating in lower precision. All of the computation is done with PETSc so we are limited to double precision there (or whatever PETSc has been compiled with), but the transposes are just mpi4py calls with numpy arrays. We could do these (and the fft/ifft) with a lower precision to reduce communication volume again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core functionality Adding to the main paradiag functionality performance Improving or fixing the computational performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant