Add an option to drop the request #732

dkalinowski · 2024-08-02T14:15:09Z

This enables to drop user request in case the client is disconnected (when used in OVMS).

OVMS commit using this:
openvinotoolkit/model_server#2610

ilya-lavrenov · 2024-08-02T20:32:56Z

is it possible to cover such scenario with tests ?

mzegla · 2024-08-05T09:41:17Z

src/cpp/include/openvino/genai/generation_handle.hpp

    GenerationOutputs back();
    // Reads result of a generation for single iteration
    GenerationOutputs read();
    // Reads all generated tokens for all sequences
    std::vector<GenerationOutput> read_all();
 };

-using GenerationHandle = std::unique_ptr<GenerationHandleImpl>;
+using GenerationHandle = std::shared_ptr<GenerationHandleImpl>;


With that change we can have multiple handles pointing to a single stream and we give an explicit option to drop generation via method call. This means that one handle can call drop and invalidate handle for not only itself but potentially other handles.

I think that if we go this way we should block any calls on handle that has been dropped (throw errors for example).

I think that it is safe to say that current approach with unique_ptr did not restrict GenAI API users from handle misusage, one could simply take the reference of the handle and do whatever.

Changing to shared_ptr and exposing explicit drop() method gives more flexibilty which OVMS needed - dropping the request in HTTP client disconnection callback (look up OVMS pull request). Now, multiple threads can use the handle (http thread and the mediapipe thread) and the generation is dropped once all handle shared references are gone.

I also agree with you that we could verify nobody calls read/read_all methods after handle is dropped. Will add that

src/cpp/src/sequence_group.hpp

mzegla · 2024-08-05T09:58:20Z

src/cpp/src/continuous_batching_pipeline.cpp

+                // Notify the last time even if there will be no results
+                // This causes read_all() to unblock in all situations
+                request->notify_handle();


I don't think we can do it that way. The idea is that we should have only one notification per step() (if any). Here we add another call and when called in such circumstances, notify_handle() will always send tokens to handle. So we can end up sending the same results twice. For example when generation finishes it notifies handle in sampler and then here in cleanup method. In the streaming scenario, if generation handle doesn't read() token and check status between these calls it will read last token twice.

I have removed notify_handle call in case there is out of memory error. Now it is not duplicated, those will be notified in _free_non_running_requests.

Separated the empty push to dropped handles into another step in stepping method.

src/cpp/src/sequence_group.hpp

mzegla · 2024-08-06T11:44:51Z

src/cpp/src/continuous_batching_pipeline.cpp

@@ -60,6 +60,15 @@ class ContinuousBatchingPipeline::Impl {
    ChatHistory m_history;


+    void _notify_requests_dropped_by_handle() {
+        // Notify the last time by pushing empty output
+        // This causes read_all() to unblock by adding anything to the queue


Suggested change

// This causes read_all() to unblock by adding anything to the queue

// This causes read() to unblock when called before status change by adding anything to the queue

mzegla · 2024-08-06T11:46:13Z

src/cpp/src/generation_handle.cpp

+void GenerationHandleImpl::drop() {
+    m_generation_stream->drop();
+}
+
 std::unordered_map<uint64_t, GenerationOutput> GenerationHandleImpl::back() {


I think we should block all methods not just read.

mzegla · 2024-08-06T11:47:01Z

src/cpp/src/generation_handle.cpp

 std::unordered_map<uint64_t, GenerationOutput> GenerationHandleImpl::back() {
    return m_generation_stream->back();
 }

 std::unordered_map<uint64_t, GenerationOutput> GenerationHandleImpl::read() {
+    OPENVINO_ASSERT(!is_dropped(), "Read cannot be called while underlying GenerationStream is already in dropped by handle state.");


Suggested change

OPENVINO_ASSERT(!is_dropped(), "Read cannot be called while underlying GenerationStream is already in dropped by handle state.");

OPENVINO_ASSERT(!is_dropped(), "GenerationHandle cannot be used after it is dropped.");

dkalinowski · 2024-08-06T13:22:34Z

@ilya-lavrenov

is it possible to cover such scenario with tests ?

I don't think we can test it on unit test level, we would need to work with real models.
For functional tests (tests/python_tests), the drop API (and GenerationHandle in general) is not exposed.
I'm currently adding unit tests within OVMS to test disconnections: openvinotoolkit/model_server#2610

ilya-lavrenov · 2024-08-06T13:28:36Z

@ilya-lavrenov

is it possible to cover such scenario with tests ?

I don't think we can test it on unit test level, we would need to work with real models. For functional tests (tests/python_tests), the drop API (and GenerationHandle in general) is not exposed. I'm currently adding unit tests within OVMS to test disconnections: openvinotoolkit/model_server#2610

You can use real models in our tests - we use them in a lot of our tests.

mzegla · 2024-08-06T13:39:47Z

Real models are used in Python tests, so we would need to expose GenerationHandle via python, so new tests can use lower level API since we can't test it when running pipeline via generate() calls.

ilya-lavrenov · 2024-08-06T15:28:56Z

src/cpp/src/continuous_batching_pipeline.cpp

@@ -60,6 +60,15 @@ class ContinuousBatchingPipeline::Impl {
    ChatHistory m_history;


+    void _notify_requests_dropped_by_handle() {
+        // Notify the last time by pushing empty output
+        // This causes read() to unblock by adding anything to the queue


Even with this comment I don't fully understand why do we need to send empty outputs?
If handle is dropped by user, user should not expect any empty outputs from this request / handle

It's additional protection in multithreading scenarios. When generation handle is dropped from thread #1 and thread #2 is blocked on read() this operation unlocks it. That's the case in model server where handle drop is called from a callback triggered by HTTP server on client disconnect.

…d streaming) (#2610) * Patch tensorflow net_http to allow for installing client disconnection callbacks * Use new genai, add tests, fix building without mediapipe, disconnect unary as well * Tests CVS-148134 Modifications to GenAI: openvinotoolkit/openvino.genai#732

dkalinowski requested review from Wovchena, dtrawins and mzegla August 2, 2024 14:15

dkalinowski mentioned this pull request Aug 2, 2024

Cancel workload when OpenAI request client disconnects (both unary and streaming) openvinotoolkit/model_server#2610

Merged

3 tasks

dkalinowski force-pushed the test4 branch from 960e7f9 to ef3c65c Compare August 2, 2024 14:15

ilya-lavrenov added this to the 2024.4 milestone Aug 2, 2024

ilya-lavrenov assigned ilya-lavrenov and mzegla and unassigned ilya-lavrenov Aug 2, 2024

mzegla reviewed Aug 5, 2024

View reviewed changes

src/cpp/src/sequence_group.hpp Outdated Show resolved Hide resolved

src/cpp/src/sequence_group.hpp Outdated Show resolved Hide resolved

dkalinowski force-pushed the test4 branch 2 times, most recently from 4654360 to eba3844 Compare August 6, 2024 11:35

mzegla reviewed Aug 6, 2024

View reviewed changes

dkalinowski added 9 commits August 6, 2024 15:18

V1

4909204

post-review

b95638f

test

1a9c89d

fix

63059c9

test

3eddb8f

post review 2

673be16

fix

933a7fd

post review 3

4e21aa2

post review 5

08476d8

dkalinowski force-pushed the test4 branch from f5efbd4 to 08476d8 Compare August 6, 2024 13:18

mzegla approved these changes Aug 6, 2024

View reviewed changes

ilya-lavrenov reviewed Aug 6, 2024

View reviewed changes

ilya-lavrenov added this pull request to the merge queue Aug 6, 2024

Merged via the queue into openvinotoolkit:master with commit 0be2620 Aug 6, 2024
33 checks passed

ilya-lavrenov added the category: continuous batching Continuous batching label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to drop the request #732

Add an option to drop the request #732

dkalinowski commented Aug 2, 2024

ilya-lavrenov commented Aug 2, 2024

mzegla Aug 5, 2024

dkalinowski Aug 5, 2024 •

edited

Loading

mzegla Aug 5, 2024

dkalinowski Aug 5, 2024

dkalinowski Aug 6, 2024

mzegla Aug 6, 2024

dkalinowski Aug 6, 2024

mzegla Aug 6, 2024

dkalinowski Aug 6, 2024

mzegla Aug 6, 2024

dkalinowski Aug 6, 2024

dkalinowski commented Aug 6, 2024

ilya-lavrenov commented Aug 6, 2024

mzegla commented Aug 6, 2024

ilya-lavrenov Aug 6, 2024 •

edited

Loading

mzegla Aug 6, 2024

	// This causes read_all() to unblock by adding anything to the queue
	// This causes read() to unblock when called before status change by adding anything to the queue

	OPENVINO_ASSERT(!is_dropped(), "Read cannot be called while underlying GenerationStream is already in dropped by handle state.");
	OPENVINO_ASSERT(!is_dropped(), "GenerationHandle cannot be used after it is dropped.");

Add an option to drop the request #732

Add an option to drop the request #732

Conversation

dkalinowski commented Aug 2, 2024

ilya-lavrenov commented Aug 2, 2024

Choose a reason for hiding this comment

dkalinowski Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkalinowski commented Aug 6, 2024

ilya-lavrenov commented Aug 6, 2024

mzegla commented Aug 6, 2024

ilya-lavrenov Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkalinowski Aug 5, 2024 •

edited

Loading

ilya-lavrenov Aug 6, 2024 •

edited

Loading