This project contains benchmarks related to FFT Corner Turns.
For full functionality, this project requires the following.
A C compiler that supports AVX-512 instructions, e.g., one of:
- GCC >=
5.0.0
- tested with version9.1.0
. - Clang >=
3.9.1
- tested with version10.0.1
. - Intel C Compiler >=
15.0.1
- tested with version19.0
.
Build tools:
- CMake >=
2.8.12
- pkg-config
Libraries:
- POSIX Threads - pthreads-compatible library (usually included with compiler).
- FFTW3 - tested with version
3.3.8
. Both single and double precision libraries are used (libfftw3f
,libfftw3
). - Intel MKL - tested with version
2019 Update 4
.
The project uses CMake:
mkdir build
cd build
cmake ..
make
If dependencies are installed to a non-standard location ${PREFIX}
, you must
first configure PKG_CONFIG_PATH
.
E.g., if FFTW3
is configured with ./configure --prefix=${PREFIX}
:
export PKG_CONFIG_PATH=${PREFIX}/lib/pkgconfig/
To use a different C compiler than your environment's cc
, configure CMake's
CMAKE_C_COMPILER
variable, e.g.:
cmake .. -DCMAKE_C_COMPILER=/path/to/cc
There are a handful benchmark program templates:
transp
: Populate a matrix and perform a transpose.fft-2d
: Populate a matrix and perform a 2-D FFT. Whether a transpose is actually performed depends on the FFT implementation.fft-ct
: Populate a matrix and perform 1-D FFTs -> transpose -> 1-D FFTs. In this benchmark, a transpose is always performed.
C99 types:
float
(flt)double
(dbl)float complex
(fcmplx)double complex
(dcmplx)
FFTW types:
fftwf_complex
(fftwf) - redefined asfloat complex
fftw_complex
(fftw) - redefined asdouble complex
MKL types:
MKL_Complex8
(cmplx8) - redefined asfloat complex
MKL_Complex16
(cmplx16) - redefined asdouble complex
Benchmark templates are used to generate benchmarks supporting a variety of data
types and transpose implementations using different algorithms and library APIs.
Benchmark names are generally in the form:
${prog}-${datatype}-${algo}[-${lib}]
.
All benchmarks support the -h
parameter to print a usage/help message.
Benchmarks require specifying, at a minimum, the matrix row and column count.
E.g., to perform a naive transpose of a 2048x4096 matrix with double
data:
./transp-dbl-naive -r 2048 -c 4096
Some implementations have constraints on parameters:
- Blocked transposes must use block dimensions that are divisors of their corresponding matrix dimensions. I.e., partial blocks are not supported.
- Transposes using AVX-512 instructions require matrix sizes to be multiples of 8x8 blocks. This constraint extends to threaded AVX-512 implementations -- each thread's partition of a matrix must be a multiple of 8x8, e.g., while a single thread (or even three threads) may transpose a 24x24 matrix, two threads cannot because data is partitioned evenly between threads (12x24 or 24x12, for two threads).