High-Performance FP32 Matrix Multiplication on CPU

Important note: Please don’t expect peak performance without fine-tuning hyperparameters such as the number of threads, kernel size and block sizes, unless you're running it on a Ryzen 9700X. Current parallelization strategy is optimized for Intel and AMD x86 desktop CPUs. For many-core server processors, consider using nested parallelism and parallelizing 2-3 loops to increase the performance (e.g., the 5th, 3rd, and 2nd loops around the kernel).

Key Features

Optimized for x86 desktop CPUs with FMA3 and AVX2 instructions
Faster than OpenBLAS and MKL
Step by step, beginner-friendly tutorial
Simple and compact implementation
Multithreading via OpenMP
High-level design follows BLIS

Installation

Install the following packages via apt if you are using a Debian-based Linux distribution

sudo apt-get install cmake build-essential python3-dev python3-pip libomp-dev

Create the virtual environment using pip or conda e.g.

python3 -m venv .venv
source .venv/bin/activate

and install the Python dependencies

python -m pip install -r requirements.txt

Optional:

To benchmark OpenBLAS, start by installing it according to the installation guide. During the installation, ensure you set an appropriate TARGET and disable AVX512 instructions. For instance, if you're using Zen4/5 CPUs, compile OpenBLAS with:

make TARGET=ZEN

Otherwise, OpenBLAS defaults to AVX512 instructions available on Zen4/5 CPUs.

Performance

Test enviroment:

CPU: AMD Ryzen 7 9700X | Ryzen 9 7900X
Locked CPU clock frequency: 4.5GHz
RAM: 32GB DDR5 6000 MHz CL36
OpenBLAS v.0.3.26
Compiler: GCC 13.3.0
OS: Ubuntu Ubuntu 24.04.1 LTS

First, lock CPU clock frequency:

sudo cpupower frequency-set -u CLK
sudo cpupower frequency-set -d CLK

Ensure that clock frequency is stable and doesn't vary during the benchmark. You can check the CPU clock frequency by running the command

watch -n 1 grep \"cpu MHz\" /proc/cpuinfo

To benchmark the matmul implementation, run

cmake -B build -S . -DOPENBLAS=OFF -DNTHREADS=X
cmake --build build
./build/benchmark MINSIZE MAXSIZE NPTS WARMUP

and set -DNTHREADS according to your CPU. On Ryzen 9700X I use -DNTHREADS=16. If not manually specified, default benchmark parameters are MINSIZE=200, MAXSIZE=8000, NPTS=40, WARMUP=5.

To benchmark OpenBLAS, run

cmake -B build -S . -DOPENBLAS=ON -DOPENBLAS_PATH=path/to/OpenBLAS/
cmake --build build
./build/benchmark MINSIZE MAXSIZE NPTS WARMUP

Or you can use

bash benchmark.sh /path/to/OpenBLAS NTHREADS

to benchmark both the code and OpenBLAS.

For the visualization of the results, simply run

python plot_benchmark.py

Tests

bash test.sh NTHREADS

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
assets		assets
common		common
src		src
tutorial		tutorial
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
benchmark.c		benchmark.c
benchmark.sh		benchmark.sh
format.sh		format.sh
plot_benchmark.py		plot_benchmark.py
requirements.txt		requirements.txt
test.c		test.c
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Performance FP32 Matrix Multiplication on CPU

Key Features

Installation

Optional:

Performance

Tests

About

Languages

License

salykova/matmul.c

Folders and files

Latest commit

History

Repository files navigation

High-Performance FP32 Matrix Multiplication on CPU

Key Features

Installation

Optional:

Performance

Tests

About

Topics

Resources

License

Stars

Watchers

Forks

Languages