Profile SPEC performance #55

zhisong · 2018-12-07T01:17:31Z

SPEC profiling

I ran ARM forge, a tracing/profiling tool for parallel computing.
I ran it on both xspec and dspec. The test case is a LHD high beta case with 5 volumes, resolutions are M=10, N=8, Lrad=6.
The test is based on the master branch 007be72, newest as date 6/12/2018.
The input and output file, the overall results are attached as .txt files and .pdf files to this thread.

A break down of time calling each subroutines. Green is computation, blue is MPI (mostly waiting for other cpus to finish). The most time consuming subroutine is ma00aa (as expected), contributing 77%.

Going into ma00aa, the most time consuming ones are radial integral ones.

And on each clause, this is the break down.

It seems a lot of time is spent on accessing memory.

xspec - Performance Report.pdf
dspec - Performance Report.pdf
finite_beta_V5_vmec1.sp.txt
ddt_output.txt

The text was updated successfully, but these errors were encountered:

jloizu · 2018-12-07T08:36:09Z

That's really great! That sets the first milestone towards SPEC speed optimization. Looks like what Stuart suggested, namely optimizing the volume integrals by e.g. exploiting symmetries, is the way to go.

lazersos · 2018-12-07T09:39:14Z

@zhisong This is great work and underlies my suspicions about the code from day one, specifically the rank 4 arrays are inhibiting vectorization. @jloizu while reducing the total number of floating point operations will help please note that in this analysis the issue is memory access. I think one key to speeding up the code is to vectorize the rank 4 arrays down to rank 2 arrays.

abaillod · 2019-06-11T08:32:13Z

Hello there,

I have been looking to SPEC performance as well in the framework of parallelization course at EPFL. I used Intel profiling tool (Vtune Amplifier) to check which routines were the most time consuming and I got the same results as you - about 50% of the time is spent is ma00aa.h (The test case is an axisymmetric torus, with M=4 and Lrad=16, 32 volumes, Lfindzero=1).

I tried to parallelize the volume integral loops with OpenMP. I was expecting substantial speedup since the loops are independent of each others... However I had to use some !$OMP CRITICAL around each call to the FFTW library to ensure each call was executed by one thread at a time, since apparently FFTW routines are not thread safe. I think it is possible to make them thread safe (see FFTW documentation http://www.fftw.org/fftw3_doc/Thread-safety.html) but I couldn't make it work for now. Any help on this would be appreciated!

When I run Vtune amplifier with the OpenMP version of ma00aa.h (with 4 threads, 1 MPI task), I get

So apparently lots of time is lost in

The initial fork.
Some kind of barriers (I guess these are related to the critical constructs around the FFTW library calls)

I attach at the end of this post a plot of a strong and weak scaling I did with the OpenMP version of ma00aa.h. You can see that there is almost no speedup. My guess is that threads are queuing to use the FFTW library.

Please, tell me what you think about this and if you think it is worth it!

zhisong · 2019-06-11T13:38:12Z

@abaillod Thank you for your impressive progress.

After some search, it seems to me that !$OMP CRITICAL makes the threads queuing to use the FFTW library.
The website you quoted does provide a solution to multi-thread programming with FFTW. Since the FFTW plan is the same for all threads, you may need to follow the suggestion in the link New arrays execute functions.

Other than OpenMP, I have another suggestion to speed up the computing by a factor of Nquad (the number of Gauss quadrature points), but we may need to completely rewrite the structure of metrix, coords, ma00aa and so on. We can discuss this "big" project when Stuart visits ANU.

zhisong · 2019-07-10T05:36:55Z

@abaillod I found another possibility that slowed down the calculation, that is, the LAPACK subroutines that inverts the matrices. If you are using Intel MKL libraries, you should be able to use OpenMP version of LAPACK by setting MKL_NUM_THREADS=something>1 to speed up.

For your problem concerning FFTW, you can put !$OMP PARALLEL commend after WCALL( ma00aa, metrix,( lvol, lss ) ), so that you are effectively parallelizing with respect to mn instead of radial coordinate.

zhisong assigned SRHudson, lazersos, zhisong, zhucaoxiang, jloizu and jbreslau Dec 7, 2018

missing-user mentioned this issue Nov 15, 2024

pp00aa dynamic OpenMP scheduling improves load imbalance #207

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile SPEC performance #55

Profile SPEC performance #55

zhisong commented Dec 7, 2018

jloizu commented Dec 7, 2018

lazersos commented Dec 7, 2018 •

edited

Loading

abaillod commented Jun 11, 2019

zhisong commented Jun 11, 2019 •

edited

Loading

zhisong commented Jul 10, 2019

Profile SPEC performance #55

Profile SPEC performance #55

Comments

zhisong commented Dec 7, 2018

SPEC profiling

jloizu commented Dec 7, 2018

lazersos commented Dec 7, 2018 • edited Loading

abaillod commented Jun 11, 2019

zhisong commented Jun 11, 2019 • edited Loading

zhisong commented Jul 10, 2019

lazersos commented Dec 7, 2018 •

edited

Loading

zhisong commented Jun 11, 2019 •

edited

Loading