Dynamic Dispatch for Kernels + Support MKL-based kernels w/ Fallback #122

balbit · 2024-12-05T20:35:23Z

Problem

All kernel implementations consolidated in matmul.h, under monolithic MatmulOperator class
- Includes messy list of various matmul implementations, forcing downstream client #ifdef handling
- Difficult to migrate kernels
- Fallback implementations require duplicating all matmul code (present across cuda, metal, and mkl)
- Can't compile >1 kernel at once
Metal kernel missing support

Changes

Migrate MatmulOperator Structure

Change MatmulOperator to virtual base class
- Consolidate kernel deployment in factory
Corresponding Makefile modifications
Subclass for each kernel
Migrate callsites (in llm/ops) to use references
Test fallbacking

Per-kernel migration

Metal Changes

Main branch Metal would fail to build due to missing operations

Auto downloads metal-cpp
Fix operations

Undefined symbols for architecture arm64:
  "matmul::MatmulOperator::mat_mul_accelerator_transposed_fastover_column_bias(matmul_params const*)", referenced from:
      Linear_FP::forward(Matrix3D<float> const&, Matrix3D<float>&) in linear.o
ld: symbol(s) not found for architecture arm64
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

Instructions for MKL Setup

Testing

CUDA

(TinyChatEngine) elliotliu@hanlab-MSI-4090:~/TinyChatEngineMain/llm$ ./chat LLaMA2_7B_chat INT4 8
TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!

USER: Hi
ASSISTANT:  Hello! How may I assist you today?


Inference latency, Total time: 8.8 s, 735.3 ms/token, 1.4 token/s, 12 tokens

(TinyChatEngine) (base) elliotliu@hanlab-MSI-4090:~/TinyChatEngine/llm$ ./chat LLaMA2_7B_chat INT4 8
TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!

USER: Hi
ASSISTANT:  Hello! How may I assist you today?


Inference latency, Total time: 8.7 s, 723.3 ms/token, 1.4 token/s, 12 tokens

No slowdown for dynamic dispatch CUDA or performance degradation

Neon

> ./chat
TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA_3_8B_Instruct
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!

USER: Hi
ASSISTANT:  Hello! How can I assist you today?


Inference latency, Total time: 0.7 s, 75.1 ms/token, 13.3 token/s, 9 tokens

No slowdown or degradation for Neon

balbit and others added 3 commits December 5, 2024 15:27

cherry picked kernel changes

868d82a

dont commit tests

f69c9da

tested and fixed neon kernels

0999acf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Dispatch for Kernels + Support MKL-based kernels w/ Fallback #122

Dynamic Dispatch for Kernels + Support MKL-based kernels w/ Fallback #122

balbit commented Dec 5, 2024 •

edited

Loading

Dynamic Dispatch for Kernels + Support MKL-based kernels w/ Fallback #122

Are you sure you want to change the base?

Dynamic Dispatch for Kernels + Support MKL-based kernels w/ Fallback #122

Conversation

balbit commented Dec 5, 2024 • edited Loading

Problem

Changes

Migrate MatmulOperator Structure

Per-kernel migration

Metal Changes

Instructions for MKL Setup

Testing

CUDA

Neon

balbit commented Dec 5, 2024 •

edited

Loading