-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile SPEC performance #55
Comments
That's really great! That sets the first milestone towards SPEC speed optimization. Looks like what Stuart suggested, namely optimizing the volume integrals by e.g. exploiting symmetries, is the way to go. |
@zhisong This is great work and underlies my suspicions about the code from day one, specifically the rank 4 arrays are inhibiting vectorization. @jloizu while reducing the total number of floating point operations will help please note that in this analysis the issue is memory access. I think one key to speeding up the code is to vectorize the rank 4 arrays down to rank 2 arrays. |
Hello there, I have been looking to SPEC performance as well in the framework of parallelization course at EPFL. I used Intel profiling tool (Vtune Amplifier) to check which routines were the most time consuming and I got the same results as you - about 50% of the time is spent is ma00aa.h (The test case is an axisymmetric torus, with M=4 and Lrad=16, 32 volumes, Lfindzero=1). I tried to parallelize the volume integral loops with OpenMP. I was expecting substantial speedup since the loops are independent of each others... However I had to use some !$OMP CRITICAL around each call to the FFTW library to ensure each call was executed by one thread at a time, since apparently FFTW routines are not thread safe. I think it is possible to make them thread safe (see FFTW documentation http://www.fftw.org/fftw3_doc/Thread-safety.html) but I couldn't make it work for now. Any help on this would be appreciated! When I run Vtune amplifier with the OpenMP version of ma00aa.h (with 4 threads, 1 MPI task), I get
I attach at the end of this post a plot of a strong and weak scaling I did with the OpenMP version of ma00aa.h. You can see that there is almost no speedup. My guess is that threads are queuing to use the FFTW library. Please, tell me what you think about this and if you think it is worth it! |
@abaillod Thank you for your impressive progress. After some search, it seems to me that Other than OpenMP, I have another suggestion to speed up the computing by a factor of |
@abaillod I found another possibility that slowed down the calculation, that is, the LAPACK subroutines that inverts the matrices. If you are using Intel MKL libraries, you should be able to use OpenMP version of LAPACK by setting For your problem concerning FFTW, you can put |
SPEC profiling
I ran ARM forge, a tracing/profiling tool for parallel computing.
I ran it on both xspec and dspec. The test case is a LHD high beta case with 5 volumes, resolutions are M=10, N=8, Lrad=6.
The test is based on the master branch 007be72, newest as date 6/12/2018.
The input and output file, the overall results are attached as .txt files and .pdf files to this thread.
A break down of time calling each subroutines. Green is computation, blue is MPI (mostly waiting for other cpus to finish). The most time consuming subroutine is ma00aa (as expected), contributing 77%.
Going into ma00aa, the most time consuming ones are radial integral ones.
And on each clause, this is the break down.
It seems a lot of time is spent on accessing memory.
xspec - Performance Report.pdf
dspec - Performance Report.pdf
finite_beta_V5_vmec1.sp.txt
ddt_output.txt
The text was updated successfully, but these errors were encountered: