Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

vshampor · 2024-10-16T13:04:23Z

Tickets:
153783

src/core/src/op/paged_attention.cpp

slyalin · 2024-10-28T07:17:35Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

+            pa_arguments.insert(pa_arguments.begin() + 13, v0::Constant::create(element::f32, Shape{0}, {}));
+            pa_arguments.insert(pa_arguments.begin() + 14, v0::Constant::create(element::i32, Shape{0}, {}));


If you make these inputs really optional, these two lines are not required.

slyalin · 2024-10-28T07:19:57Z

src/core/src/op/paged_attention.cpp

+            get_input_partial_shape(13).rank().is_dynamic() ||
+            get_input_partial_shape(13).rank().get_length() == 0 ||
+            get_input_partial_shape(13).rank().get_length() == 1,
+            "Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",


Suggested change

"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",

"Input `rotation_coefficients` should either have rank 1 or omitted, but it has rank ",

"Empty" shape means [0] here, which have rank 1.

slyalin · 2024-10-28T07:20:39Z

src/core/src/op/paged_attention.cpp

+    NODE_VALIDATION_CHECK(
+            this,
+            get_input_partial_shape(13).rank().is_dynamic() ||
+            get_input_partial_shape(13).rank().get_length() == 0 ||


Suggested change

get_input_partial_shape(13).rank().get_length() == 0 ||

slyalin · 2024-10-28T07:23:08Z

src/core/src/op/paged_attention.cpp

+            get_input_partial_shape(14).rank().get_length() == 0 ||
+            get_input_partial_shape(14).rank().get_length() == 1,
+            "Input `rotated_block_indices` should either have an empty shape or rank 1 but it has rank ",


The same comment are applicable here as for input 13 above.

slyalin · 2024-10-28T07:27:03Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

@@ -1576,6 +1591,11 @@ struct AttentionExecutor : public PagedAttentionExecutor {
        if (alibi_slopes) {
            alibi_slopes.assert_dims({H});
        }
+
+        if (rotated_block_indices) {
+            // Rotation, and cache eviction, is limited to cases when Q, K and V embedding sizes are equal, e.g. S == Sv


We already have cases where they are not: minicpm-3

Removed - realized that we don't need that limitation for cache rotation since we only rotate the K values

slyalin · 2024-10-28T07:30:07Z

src/plugins/intel_gpu/src/plugin/ops/paged_attention.cpp

@@ -58,6 +59,10 @@ static void CreatePagedAttentionExtensionOp(ProgramBuilder& p, const std::shared
    OPENVINO_ASSERT(alibi_const != nullptr);
    prim.has_alibi = ov::shape_size(alibi_const->get_output_shape(0)) > 0;

+    std::shared_ptr<ov::op::v0::Constant> rotation_coefficients_const = std::dynamic_pointer_cast<ov::op::v0::Constant>(op->get_input_node_shared_ptr(rotation_coefficients_idx));
+    OPENVINO_ASSERT(rotation_coefficients_const != nullptr);
+    prim.has_rotation_coefficients = ov::shape_size(alibi_const->get_output_shape(0)) > 0;


alibi_const shouldn't be used here -- bad copy&paste?

Fixed, thanks.

dmitry-gorokhov · 2024-11-15T07:52:52Z

@luo-cheng2021 Please review CPU PA changes.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

luo-cheng2021 · 2024-11-18T06:54:51Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+                CT cache_value_1 = *cache_value_1_ptr;
+
+                *cache_value_0_ptr = cache_value_0 * rotation_value_cos - cache_value_1 * rotation_value_sin;
+                *cache_value_1_ptr = cache_value_0 * rotation_value_sin + cache_value_1 * rotation_value_cos;


Is the algorithm same with the following code?

openvino/src/plugins/intel_cpu/src/nodes/rope.cpp

Lines 158 to 161 in c4d6d2b

auto src0 = src[i];

auto src1 = src[i + half_rotary_dims];

dst[i] = cos[i] * src0 - sin[i] * src1;

dst[i + half_rotary_dims] = cos[i + half_rotary_dims] * src1 + sin[i + half_rotary_dims] * src0;

If so, the following code can be used as reference:

openvino/src/plugins/intel_cpu/src/nodes/rope.cpp

Lines 35 to 102 in c4d6d2b

static std::shared_ptr<kernel::JitKernelBase> createJitKernel(const jit_rotary_compile_params& param, bool check_vec_size2 = false) {

std::shared_ptr<kernel::JitKernelBase> res;

MAYBE_UNUSED(param);

MAYBE_UNUSED(check_vec_size2);

#if defined(OPENVINO_ARCH_X86_64)

if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core)) {

bool flag = true;

if (check_vec_size2) {

auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>::vec_size;

if (param.rotary_ndims % (vec_size * 2) != 0)

flag = false;

}

if (flag)

res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>>(param);

} else if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) {

bool flag = true;

if (check_vec_size2) {

auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>::vec_size;

if (param.rotary_ndims % (vec_size * 2) != 0)

flag = false;

}

if (flag)

res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>>(param);

}

if (res)

res->create_kernel();

#endif // OPENVINO_ARCH_X86_64

return res;

}

static void execJitKernel(const std::shared_ptr<kernel::JitKernelBase>& ker, const void* src, void* dst, const float* cos, const float* sin) {

MAYBE_UNUSED(ker);

MAYBE_UNUSED(src);

MAYBE_UNUSED(dst);

MAYBE_UNUSED(cos);

MAYBE_UNUSED(sin);

#if defined(OPENVINO_ARCH_X86_64)

jit_rotary_call_args call_args;

call_args.src = src;

call_args.cos = cos;

call_args.sin = sin;

call_args.dst = dst;

(*ker)(&call_args);

#endif // OPENVINO_ARCH_X86_64

}

template <typename T>

struct RoPE::RoPEExecutorRotateHalf : public RoPE::Executor {

const op::internal::RoPE::Config& m_config;

std::shared_ptr<kernel::JitKernelBase> m_rotaryKernel;

RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {

jit_rotary_compile_params jcp;

jcp.src_prc = precision_of<T>::value;

jcp.dst_prc = precision_of<T>::value;

jcp.rotary_ndims = config.rotary_ndims;

jcp.interleave = false;

m_rotaryKernel = createJitKernel(jcp);

}

I have already written and tested my implementation, besides, the code you've sent me probably cannot be reused without modifications or bulky instantiations.

luo-cheng2021 · 2024-11-18T07:03:26Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+
+template<class CT>
+inline static void rotate_kv_cache_block_hw(CT* cache_block_ptr, float* block_rotation_coefficients_ptr, size_t num_heads, size_t block_size, size_t embedding_size) {
+#if !defined(HAVE_AVX2) && !defined(HAVE_AVX512F)


It should be cleaner if rotate_kv_cache_block_hw and rotate_kv_cache_block_sw are merged and let the rotate_kv_cache_chunk_xxx to handle the tails.

I need HW and SW available as separate functions for testing purposes.

The test style CPU plugin used is to use layer/subgraph(sdpa sample) test to cover, due to CI infrastructure can cover avx2/avx512/amx, so there is no need to create a new test structure, we'd better to create a subgraph test just like sdpa.
But PagedAttention node is a little special, it's no reference now which means no reference result to compare, I think the reference will be coming.
Base on current status, I suggest the new test module may be removed and file a ticket to record the following up.
Correct me if I'm wrong. @dmitry-gorokhov @slyalin

I personally don't mind splitting vector and scalar impls into separate functions. I would say it is even better from code readability standpoint.

Regarding unit tests - that is actaully good developer practice. The fact we haven't implemented unit tests for intrinsics optimizations does't mean we should prohibit it at all. That seems to be convinient for developer purposes, so I am happy to have such an infrastructure.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

src/core/src/op/paged_attention.cpp

dmitry-gorokhov · 2024-12-25T05:38:08Z

src/plugins/intel_cpu/tests/unit/vectorized/paged_attn_cache_rotation.cpp

+    template <class T>
+    void test_chunk_rotation_for_type() {
+        auto instruction_set = std::get<0>(GetParam());
+        if (instruction_set == TargetInstructionSet::AVX512 && (!ov::with_cpu_x86_avx512f())) {


We need the same condition for avx2. We still officially support CPUs with SSE isa only (even though we don't have them in pre-commit) - need to keep the tests green there.

luo-cheng2021 · 2024-12-25T06:36:06Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp

@@ -143,20 +143,34 @@ inline void mm512_uni_storeu_tail_ps(ov::float16* addr, __m512 v, size_t count)
 }
 #endif

+#if defined(HAVE_AVX2)


github-actions bot added category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Oct 16, 2024

vshampor force-pushed the token_rotation branch from 2a172b2 to c071571 Compare October 26, 2024 00:52

slyalin requested changes Oct 28, 2024

View reviewed changes

github-actions bot added the category: build OpenVINO cmake script / infra label Oct 30, 2024

vshampor changed the title ~~Add cache rotation inputs~~ Add cache rotation inputs and CPU kernel implementation for cache rotation Nov 12, 2024

vshampor mentioned this pull request Nov 12, 2024

Token rotation openvinotoolkit/openvino.genai#987

Open

vshampor force-pushed the token_rotation branch from d90e212 to ed46cfe Compare November 12, 2024 20:30

dmitry-gorokhov assigned luo-cheng2021 Nov 15, 2024

luo-cheng2021 requested changes Nov 18, 2024

View reviewed changes

luo-cheng2021 reviewed Nov 18, 2024

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp Outdated Show resolved Hide resolved

vshampor requested a review from slyalin November 18, 2024 10:11

vshampor force-pushed the token_rotation branch from afa851c to 8a355f3 Compare November 18, 2024 18:10

github-actions bot added category: Python API OpenVINO Python bindings category: TF FE OpenVINO TensorFlow FrontEnd category: PyTorch FE OpenVINO PyTorch Frontend category: JAX FE OpenVINO JAX FrontEnd labels Nov 18, 2024

vshampor force-pushed the token_rotation branch from 8a355f3 to a33f255 Compare November 19, 2024 10:13

github-actions bot added category: CI OpenVINO public CI github_actions Pull requests that update GitHub Actions code category: NPU OpenVINO NPU plugin labels Nov 19, 2024

vshampor force-pushed the token_rotation branch from 138de47 to a0818c2 Compare November 19, 2024 12:12

vshampor requested a review from luo-cheng2021 November 19, 2024 12:20

vshampor marked this pull request as ready for review November 19, 2024 12:22

vshampor requested review from a team as code owners November 19, 2024 12:22

itikhono reviewed Dec 24, 2024

View reviewed changes

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp Show resolved Hide resolved

itikhono reviewed Dec 24, 2024

View reviewed changes

src/core/src/op/paged_attention.cpp Outdated Show resolved Hide resolved

vshampor added 22 commits December 24, 2024 12:52

Add cache rotation inputs in transformations, CPU and GPU plugins

b75fd35

Fix warnings

d3b668c

Remove diag message

9d7bf14

Fix more warnings

27ef9b3

Compile stub test if not on x86

b60ee77

Use 16-core executor

13ca2bf

Fix Android builds

dd8a73d

Fix type prop test

caa5b10

Move cache rotation before current KV entry copy

0b600f6

Apply comments

d93c8d5

Fix build on Win

d4b03b1

Fix PA build

a533b8c

Add NeoX?

0f6fb65

Refactor to delta + LUT

6b85c17

Format

3618d49

Debug prints

feb7186

Add shape to rotation deltas

32d4180

Add test

d6af7d1

Remove debug prints

d9518b8

Format

38c2ef7

Fix PA tests

51f3892

Fix typo

86d281f

vshampor force-pushed the token_rotation branch from 7777b4e to 86d281f Compare December 24, 2024 11:52

dmitry-gorokhov reviewed Dec 25, 2024

View reviewed changes

luo-cheng2021 approved these changes Dec 25, 2024

View reviewed changes

wenjiew added this to the 2025.0 milestone Dec 26, 2024

vshampor added 2 commits December 27, 2024 13:00

Skip AVX2 vectorized test if no AVX2 on host

9781330

Remove extra ifdef

8b7f5ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

vshampor commented Oct 16, 2024

slyalin Oct 28, 2024 •

edited

Loading

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

slyalin Oct 28, 2024 •

edited

Loading

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

dmitry-gorokhov commented Nov 15, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 21, 2024 •

edited

Loading

dmitry-gorokhov Nov 27, 2024 •

edited

Loading

dmitry-gorokhov Dec 25, 2024

vshampor Dec 27, 2024

luo-cheng2021 Dec 25, 2024

vshampor Dec 27, 2024

		pa_arguments.insert(pa_arguments.begin() + 13, v0::Constant::create(element::f32, Shape{0}, {}));
		pa_arguments.insert(pa_arguments.begin() + 14, v0::Constant::create(element::i32, Shape{0}, {}));

	"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",
	"Input `rotation_coefficients` should either have rank 1 or omitted, but it has rank ",

	auto src0 = src[i];
	auto src1 = src[i + half_rotary_dims];
	dst[i] = cos[i] * src0 - sin[i] * src1;
	dst[i + half_rotary_dims] = cos[i + half_rotary_dims] * src1 + sin[i + half_rotary_dims] * src0;

	static std::shared_ptr<kernel::JitKernelBase> createJitKernel(const jit_rotary_compile_params& param, bool check_vec_size2 = false) {
	std::shared_ptr<kernel::JitKernelBase> res;

	MAYBE_UNUSED(param);
	MAYBE_UNUSED(check_vec_size2);

	#if defined(OPENVINO_ARCH_X86_64)

	if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core)) {
	bool flag = true;
	if (check_vec_size2) {
	auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>::vec_size;
	if (param.rotary_ndims % (vec_size * 2) != 0)
	flag = false;
	}
	if (flag)
	res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>>(param);
	} else if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) {
	bool flag = true;
	if (check_vec_size2) {
	auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>::vec_size;
	if (param.rotary_ndims % (vec_size * 2) != 0)
	flag = false;
	}
	if (flag)
	res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>>(param);
	}

	if (res)
	res->create_kernel();

	#endif // OPENVINO_ARCH_X86_64

	return res;
	}

	static void execJitKernel(const std::shared_ptr<kernel::JitKernelBase>& ker, const void* src, void* dst, const float* cos, const float* sin) {
	MAYBE_UNUSED(ker);
	MAYBE_UNUSED(src);
	MAYBE_UNUSED(dst);
	MAYBE_UNUSED(cos);
	MAYBE_UNUSED(sin);

	#if defined(OPENVINO_ARCH_X86_64)

	jit_rotary_call_args call_args;
	call_args.src = src;
	call_args.cos = cos;
	call_args.sin = sin;
	call_args.dst = dst;
	(*ker)(&call_args);

	#endif // OPENVINO_ARCH_X86_64
	}

	template <typename T>
	struct RoPE::RoPEExecutorRotateHalf : public RoPE::Executor {
	const op::internal::RoPE::Config& m_config;
	std::shared_ptr<kernel::JitKernelBase> m_rotaryKernel;

	RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {
	jit_rotary_compile_params jcp;
	jcp.src_prc = precision_of<T>::value;
	jcp.dst_prc = precision_of<T>::value;
	jcp.rotary_ndims = config.rotary_ndims;
	jcp.interleave = false;
	m_rotaryKernel = createJitKernel(jcp);
	}

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Are you sure you want to change the base?

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Conversation

vshampor commented Oct 16, 2024

slyalin Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-gorokhov commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luo-cheng2021 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

dmitry-gorokhov Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Oct 28, 2024 •

edited

Loading

slyalin Oct 28, 2024 •

edited

Loading

luo-cheng2021 Nov 21, 2024 •

edited

Loading

dmitry-gorokhov Nov 27, 2024 •

edited

Loading