Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile SPEC performance #55

Open
zhisong opened this issue Dec 7, 2018 · 5 comments
Open

Profile SPEC performance #55

zhisong opened this issue Dec 7, 2018 · 5 comments
Assignees

Comments

@zhisong
Copy link
Collaborator

zhisong commented Dec 7, 2018

SPEC profiling

I ran ARM forge, a tracing/profiling tool for parallel computing.
I ran it on both xspec and dspec. The test case is a LHD high beta case with 5 volumes, resolutions are M=10, N=8, Lrad=6.
The test is based on the master branch 007be72, newest as date 6/12/2018.
The input and output file, the overall results are attached as .txt files and .pdf files to this thread.

A break down of time calling each subroutines. Green is computation, blue is MPI (mostly waiting for other cpus to finish). The most time consuming subroutine is ma00aa (as expected), contributing 77%.
image

Going into ma00aa, the most time consuming ones are radial integral ones.
image

And on each clause, this is the break down.
image

It seems a lot of time is spent on accessing memory.

xspec - Performance Report.pdf
dspec - Performance Report.pdf
finite_beta_V5_vmec1.sp.txt
ddt_output.txt

@jloizu
Copy link
Collaborator

jloizu commented Dec 7, 2018

That's really great! That sets the first milestone towards SPEC speed optimization. Looks like what Stuart suggested, namely optimizing the volume integrals by e.g. exploiting symmetries, is the way to go.

@lazersos
Copy link
Collaborator

lazersos commented Dec 7, 2018

@zhisong This is great work and underlies my suspicions about the code from day one, specifically the rank 4 arrays are inhibiting vectorization. @jloizu while reducing the total number of floating point operations will help please note that in this analysis the issue is memory access. I think one key to speeding up the code is to vectorize the rank 4 arrays down to rank 2 arrays.

@abaillod
Copy link
Collaborator

Hello there,

I have been looking to SPEC performance as well in the framework of parallelization course at EPFL. I used Intel profiling tool (Vtune Amplifier) to check which routines were the most time consuming and I got the same results as you - about 50% of the time is spent is ma00aa.h (The test case is an axisymmetric torus, with M=4 and Lrad=16, 32 volumes, Lfindzero=1).

I tried to parallelize the volume integral loops with OpenMP. I was expecting substantial speedup since the loops are independent of each others... However I had to use some !$OMP CRITICAL around each call to the FFTW library to ensure each call was executed by one thread at a time, since apparently FFTW routines are not thread safe. I think it is possible to make them thread safe (see FFTW documentation http://www.fftw.org/fftw3_doc/Thread-safety.html) but I couldn't make it work for now. Any help on this would be appreciated!

When I run Vtune amplifier with the OpenMP version of ma00aa.h (with 4 threads, 1 MPI task), I get
image
So apparently lots of time is lost in

  • The initial fork.
  • Some kind of barriers (I guess these are related to the critical constructs around the FFTW library calls)

I attach at the end of this post a plot of a strong and weak scaling I did with the OpenMP version of ma00aa.h. You can see that there is almost no speedup. My guess is that threads are queuing to use the FFTW library.

StrWeakScaling_Hybrid

Please, tell me what you think about this and if you think it is worth it!

@zhisong
Copy link
Collaborator Author

zhisong commented Jun 11, 2019

@abaillod Thank you for your impressive progress.

After some search, it seems to me that !$OMP CRITICAL makes the threads queuing to use the FFTW library.
The website you quoted does provide a solution to multi-thread programming with FFTW. Since the FFTW plan is the same for all threads, you may need to follow the suggestion in the link New arrays execute functions.

Other than OpenMP, I have another suggestion to speed up the computing by a factor of Nquad (the number of Gauss quadrature points), but we may need to completely rewrite the structure of metrix, coords, ma00aa and so on. We can discuss this "big" project when Stuart visits ANU.

@zhisong
Copy link
Collaborator Author

zhisong commented Jul 10, 2019

@abaillod I found another possibility that slowed down the calculation, that is, the LAPACK subroutines that inverts the matrices. If you are using Intel MKL libraries, you should be able to use OpenMP version of LAPACK by setting MKL_NUM_THREADS=something>1 to speed up.

For your problem concerning FFTW, you can put !$OMP PARALLEL commend after WCALL( ma00aa, metrix,( lvol, lss ) ), so that you are effectively parallelizing with respect to mn instead of radial coordinate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants