Token rotation #987

vshampor · 2024-10-16T13:45:26Z

Ticket: 153791

To be merged after:
openvinotoolkit/openvino#27088

slyalin · 2024-12-04T10:42:51Z

src/cpp/src/cache_eviction.cpp

+                        block_rotation_data.cosines.push_back(
+                            m_rope_cos_lut[current_rotation_delta_in_blocks * m_block_size]);
+                        block_rotation_data.sines.push_back(
+                            m_rope_sin_lut[current_rotation_delta_in_blocks * m_block_size]);


Isn't it better to just pass m_rope_cos_lut and m_rope_sin_lut to PagedAttention op alongside with required indices? Having this packing on the host side doesn't provide any flexibility in how you can adjust RoPE -- code here essentially implements Gather. Potential RoPE variants that will be implemented in the future require changes in how we prepare sin/cos or how we are applying the coefficients in PA operation itself. But doing coefficient packing here gives potential data duplication along multiple LLM layers and we will need this packing each time when re-rotation applied, so it looks better to locate this code right in PA, gather values locally for each KV cache block without repacking.

It could be done, I suppose, with a minor question of whether in this case there should really be separate sin and cos inputs.

Not "potential data duplication" -- chunks of cos/sin coefficients are explicitly duplicated here for all tokens in each block. When I mentioned "potential" I meant duplication of rotation coefficients between different attention layers in the model.

BTW, we are working with double data type here, so data duplication gives 2x memory consumption impact. It will not contribute to memory peak as we are calling this function per layer, but still using double here is not necessary.

Changed the base data type to float.

slyalin · 2024-12-04T11:06:57Z

src/cpp/src/continuous_batching_impl.cpp

+                        rotation_multipliers_tensor_data[position_offset + embedding_pair_idx] =
+                            rotation_multipliers_cos[tok_idx][embedding_pair_idx];
+                        rotation_multipliers_tensor_data[position_offset + embedding_pair_idx + head_size / 2] =
+                            rotation_multipliers_sin[tok_idx][embedding_pair_idx];


Here we are repacking sin/cos coefficients again after initial copying/duplication from lut tables. Repacking is required because we expect this particular layout in PA operation -- nothing else force us to have this repacking here. But is it a really convenient layout if we need repacking here? In the code of cache eviction on CPU side, for example here https://github.com/openvinotoolkit/openvino/pull/27088/files#diff-9ec1d0710e07c208e40402e8342b86075f2bdc06a900f1a8cc3ec28385952753R17-R18 we are using sin and cos by two unrelated pointers, so packing sin and cos close to each other for each token doesn't look as requirement. If we take sin/cos in their original layout and separately for sin and cos in PA operation, repacking here wouldn't be needed and it will still be friendly for CPU implementation. Correct?

By putting the potential repacking functionality onto the GenAI lib level I thought to circumvent the potential future issue of different layouts required by devices for efficiency. Already, it seems, the GPU requires different key cache layout from the CPU case, see:

openvino.genai/src/cpp/src/device_config.hpp

Line 118 in e2fa0d0

if (m_device.find("GPU") != std::string::npos) {

Right now the difference is only in the order of the last two dims (block_size, embedding_size). It could be made non-existent, though, if we decide to forego the per-token rotation coefficient passing and only pass per-block coefficients, in which case the LUT shape should be the same both for CPU and GPU.

Together with per-token duplication it requires a lot more storage and memory bandwidth to prepare and use these coefficients in comparison to just indices. For example, if we have num_kv_heads == 1 then amount of memory spent for the coefficients is the same as rotated kv cache blocks themselves, and if KV cache is compressed, the size of coefficients if they are still stored as f32 will be at least 2x bigger. In this part of the code we are just repacking that big amount of data. So we are loading/storing this amount 4 times: first when duplicating them in another place in the code, then in this place we spend another 2 times when loading and storing them in a different layout, and then in PA when applying the coefficients. In the first and the second passes we are really dealing with double type that doubles the bandwidth requirement, BTW.

Made the LUT slower with suggestions.

slyalin · 2024-12-12T12:40:36Z

src/cpp/src/cache_eviction.cpp

+                if (current_rotation_delta_in_blocks != 0) {
+                    BlockRotationData block_rotation_data;
+                    block_rotation_data.logical_block_idx = logical_block_idx - current_rotation_delta_in_blocks;
+                    block_rotation_data.rotation_delta = current_rotation_delta_in_blocks * m_block_size;


So the number of different rotation_deltas is bounded by the maximum number of evicted blocks. And each rotation_delta is multiple of m_block_size. That means that we don't need to pre-calculate and store other values in sin/cos LUT's and gives a significantly lower bound for their size.

Implemented the idea, adjusting for reality (recent_size also counts)

Had to revert this back to maximum sequence length because in the latest iteration, in effort to align with the Python POC, I only evict once the prompt is pre-filled, which means that the occupancy of cache by single sequence is still bound only by max sequence length.

slyalin · 2024-12-12T12:45:41Z

src/cpp/src/continuous_batching_impl.cpp

+        }
+
+        // TODO (vshampor): LUT size equal to max cache size in tokens
+        //  is overkill - find a way to pass the max sequence length defined by pipeline instead


Even max sequence length is overkill. Mentioned in another comment: the number of required different rows in LUT can be divided by the block size. It makes the tables times smaller and provides a chipper extending way for the tables on-the-fly when the maximum relative distance in re-rotation exceeds the currently allocated value. Or I just have a wrong impression about how it works.

@AlexKoff88 suggests to keep the possibility of per-token rotation for the future, hence the duplication and LUT granularity right now.

Having an index per token still give you the possibility to control it with a token granularity -- and this could be kept in PA semantics. But you can organize LUT without unused gaps already now and divide all indices by block size -- it is a part of GenAI and doesn't affect how OV side is implemented. Are there any other dependencies or future work activities that can utilize per-token granularity without changing the code in this PR? Can we externally change the algorithm for cache eviction having the same GenAI released package?

Concerning per-block or per-token granularity, you can utilize well-defined broadcast semantics for PA inputs that defines rotation deltas. For example, [num_rot_blocks, block_size] tensor shape will have per-token granularity, and [num_rot_blocks, 1] will assume broadcast of the same rotation index to all tokens in blocks providing per-block granularity. This is a usual thing for element-wise operations in OV. In this way you give economical representation for per-block granularity and in the same time provide necessary flexibility to switch to per-token granularity if it is required in the future.

Added a dimension to the deltas tensor, with broadcast semantics hand-coded.

slyalin · 2024-12-12T12:51:55Z

src/cpp/src/cache_eviction.cpp

+        m_rope_sin_lut.resize(max_position_angle_multiplier);
+        m_rope_cos_lut.resize(max_position_angle_multiplier);
+
+        for (size_t i = 0; i < max_position_angle_multiplier; i++) {
+            m_rope_sin_lut[i].reserve(num_freqs);
+            m_rope_cos_lut[i].reserve(num_freqs);
+            for (size_t j = 0; j < num_freqs; j++) {
+                double exponent = -static_cast<double>(2 * j) / kv_head_size;
+                double base_angle = std::pow(rope_theta, exponent);
+                m_rope_sin_lut[i].push_back(
+                    -std::sin(i * base_angle));  // minus since we will be rotating by an inverse angle
+                m_rope_cos_lut[i].push_back(std::cos(i * base_angle));
+            }
+        }


Based on other comments about maximum number of different re-rotation values: amount of data can be reduced by block_size time and limited by the maximum number of evicted blocks.

And saving double values doesn't still make sense to me.

Changed the data type to float, want to keep per-token granularity even if it means duplication of values right now.

How can external user utilize the per-token granularity without changing the code of the cache eviction?

github-actions bot added category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms labels Oct 17, 2024

ilya-lavrenov added do_not_merge do_not_review and removed category: sampling Sampling / Decoding algorithms labels Oct 18, 2024

vshampor force-pushed the token_rotation branch from 613f8bf to 9d4f3bb Compare October 26, 2024 00:51

github-actions bot added the category: sampling Sampling / Decoding algorithms label Oct 26, 2024

github-actions bot added the category: cmake / build Cmake scripts label Nov 4, 2024

vshampor force-pushed the token_rotation branch 3 times, most recently from 5b61636 to 0d60110 Compare November 12, 2024 16:42

vshampor marked this pull request as ready for review November 12, 2024 16:42

vshampor force-pushed the token_rotation branch from 0d60110 to c6ef2c0 Compare November 13, 2024 11:14

MaximProshin requested a review from ilya-lavrenov November 18, 2024 10:06

github-actions bot added no-match-files and removed category: sampling Sampling / Decoding algorithms labels Nov 20, 2024

AlexKoff88 requested a review from l-bat November 27, 2024 11:09

ilya-lavrenov approved these changes Nov 27, 2024

View reviewed changes

vshampor force-pushed the token_rotation branch from 69d905b to 83f7045 Compare December 2, 2024 22:17

github-actions bot added the category: whisper Whisper pipeline label Dec 2, 2024

slyalin reviewed Dec 4, 2024

View reviewed changes

slyalin mentioned this pull request Dec 4, 2024

Add cache rotation inputs and CPU kernel implementation for cache rotation openvinotoolkit/openvino#27088

Open

vshampor force-pushed the token_rotation branch from 83f7045 to 533585a Compare December 11, 2024 12:12

slyalin requested changes Dec 12, 2024

View reviewed changes

vshampor force-pushed the token_rotation branch from 533585a to f073a70 Compare December 12, 2024 13:34

vshampor force-pushed the token_rotation branch from f073a70 to 2173d07 Compare December 20, 2024 13:56

github-actions bot added category: Python API Python API for GenAI category: GenAI C++ API Changes in GenAI C++ public headers labels Dec 20, 2024

Implement token rotation

48c6c78

vshampor added 13 commits December 20, 2024 14:58

Add better test IDs

b684361

Remove debug print

6c1f89e

Fill rotation coefficients before current forward step

a4d9637

Do not download whisper datasets during collection

2a8faa2

Refactor coefficient filling, do not evict/rotate at prefill

da58d36

Fix long_prompts.txt

a5ef624

Refactor to deltas + LUT

ee1cc18

Make LUT smaller

a4b7bc3

Debug prints

5d60460

Fix bugs

9e393b1

Add shape to rotation deltas

9bf9aa8

Remove debug prints, use per-block rotation

92b1310

Add tests

ea64dc5

vshampor force-pushed the token_rotation branch from 2173d07 to ea64dc5 Compare December 20, 2024 13:59

Fix max seq len

0a753e2

vshampor force-pushed the token_rotation branch from e2eef0e to 0a753e2 Compare December 20, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token rotation #987

Token rotation #987

vshampor commented Oct 16, 2024 •

edited

Loading

slyalin Dec 4, 2024

vshampor Dec 4, 2024

slyalin Dec 4, 2024 •

edited

Loading

slyalin Dec 4, 2024

vshampor Dec 12, 2024

slyalin Dec 4, 2024

vshampor Dec 4, 2024

slyalin Dec 4, 2024 •

edited

Loading

vshampor Dec 12, 2024

slyalin Dec 12, 2024

vshampor Dec 20, 2024

slyalin Dec 12, 2024

vshampor Dec 12, 2024

vshampor Dec 20, 2024

slyalin Dec 12, 2024

vshampor Dec 12, 2024

slyalin Dec 12, 2024

vshampor Dec 20, 2024

slyalin Dec 12, 2024

vshampor Dec 12, 2024

slyalin Dec 12, 2024

Token rotation #987

Are you sure you want to change the base?

Token rotation #987

Conversation

vshampor commented Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vshampor commented Oct 16, 2024 •

edited

Loading

slyalin Dec 4, 2024 •

edited

Loading

slyalin Dec 4, 2024 •

edited

Loading