[CPU] Support dynamic activation sparsity #27974

usstq · 2024-12-09T09:54:22Z

Details:

Activation sparsity exploit the fact that activations in MLP of LLMs is sparse and input channels of activations with small magnitude can be set as zero with acceptable accuracy-drop.

The distribution of sparse channels of activation is dynamic (only known at runtime) and variates a lot from token to token, thus the optimization opportunity only exists in 2nd token generation process with batch-size fixed to 1 (which is exactly typical use-case for client-side LLM inference), in which case weight memory reading cost corresponding to the skipped input channel can be saved.

The best weight memory layout for this optimization is plain [IC, OC], so weights corresponding to each input channel is dense, the non-sparse input channel can enjoy CPU's HW-prefetcher's boost to continuous stream access. if we use current blocked weight-layout set by oneDNN-fork, the weights from both non-sparse & sparse channels would be mixed together in unit of cache-line, which would hurt performance, both due to unfriendly access pattern to HW-prefetcher & DDR's physical page granularity.

But choose plain [IC,OC] layout poses challenge to 1st token latency because blocked layout is best for 1st-token/compute-bound case, so in this PR, we have to also minimize the degradation of 1st token latency.

Tickets:

CVS-148374

usstq added 9 commits October 30, 2024 16:43

add ActivationSparsityFusion

f6070c8

add activation sparse fc kernel

6f48564

update i8 impl

c2a02ab

add i4 impl

6057da2

fix int4 first-token

f581ec8

add avx general intrinsic wrapper

6bc455f

add simd abstract & AVX512 support

8ce0b42

fix AVX512 bugs

dad99c1

add reuse_B gemm kernel

99b05bc

github-actions bot added category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Dec 9, 2024

usstq added 10 commits December 10, 2024 10:45

fix bug in MM_ComputeBounded_reuseB_i8

0b58da6

fix bug in reduce_outputs

a75aa9f

fix bug in avx512 int4

c131998

support sym

61f7852

support f16 weights

599bb91

support f32 activation only

2e6a8ca

replace intrinsic with jit

e700cf2

add i8 in jit_compile_accumulate_weight

d75a323

i8 is fully-jitted

a4ac58f

remove cross-compile

cd84680

github-actions bot removed the category: build OpenVINO cmake script / infra label Dec 18, 2024

usstq added 8 commits December 18, 2024 16:59

simplify kernel interface

b9a5265

add if_ & while_

ea77a20

add simd_jit header & do_while_

69e4467

fix bugs

7182fce

clean-up

3a4461b

Merge remote-tracking branch 'origin/master' into dynsparse

8b1cd3d

add test case

f74b44f

fix CI issues

e745f17

usstq marked this pull request as ready for review December 19, 2024 13:38

usstq requested review from a team as code owners December 19, 2024 13:38

usstq added 5 commits December 19, 2024 21:48

fix CI issue2

fa98141

fix CI issue3

dd68cb9

fix CI issue 4

2071c60

fix CI issue 5

354e4de

fix test cases

a941617

usstq requested a review from luo-cheng2021 December 27, 2024 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Support dynamic activation sparsity #27974

[CPU] Support dynamic activation sparsity #27974

usstq commented Dec 9, 2024

[CPU] Support dynamic activation sparsity #27974

Are you sure you want to change the base?

[CPU] Support dynamic activation sparsity #27974

Conversation

usstq commented Dec 9, 2024

Details:

Tickets: