Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to drop the request #732

Merged
merged 9 commits into from
Aug 6, 2024

Conversation

dkalinowski
Copy link
Collaborator

This enables to drop user request in case the client is disconnected (when used in OVMS).

OVMS commit using this:
openvinotoolkit/model_server#2610

@ilya-lavrenov
Copy link
Contributor

is it possible to cover such scenario with tests ?

@ilya-lavrenov ilya-lavrenov added this to the 2024.4 milestone Aug 2, 2024
GenerationOutputs back();
// Reads result of a generation for single iteration
GenerationOutputs read();
// Reads all generated tokens for all sequences
std::vector<GenerationOutput> read_all();
};

using GenerationHandle = std::unique_ptr<GenerationHandleImpl>;
using GenerationHandle = std::shared_ptr<GenerationHandleImpl>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that change we can have multiple handles pointing to a single stream and we give an explicit option to drop generation via method call. This means that one handle can call drop and invalidate handle for not only itself but potentially other handles.

I think that if we go this way we should block any calls on handle that has been dropped (throw errors for example).

Copy link
Collaborator Author

@dkalinowski dkalinowski Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is safe to say that current approach with unique_ptr did not restrict GenAI API users from handle misusage, one could simply take the reference of the handle and do whatever.

Changing to shared_ptr and exposing explicit drop() method gives more flexibilty which OVMS needed - dropping the request in HTTP client disconnection callback (look up OVMS pull request). Now, multiple threads can use the handle (http thread and the mediapipe thread) and the generation is dropped once all handle shared references are gone.

I also agree with you that we could verify nobody calls read/read_all methods after handle is dropped. Will add that

src/cpp/src/sequence_group.hpp Outdated Show resolved Hide resolved
Comment on lines 68 to 70
// Notify the last time even if there will be no results
// This causes read_all() to unblock in all situations
request->notify_handle();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do it that way. The idea is that we should have only one notification per step() (if any). Here we add another call and when called in such circumstances, notify_handle() will always send tokens to handle. So we can end up sending the same results twice. For example when generation finishes it notifies handle in sampler and then here in cleanup method. In the streaming scenario, if generation handle doesn't read() token and check status between these calls it will read last token twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed notify_handle call in case there is out of memory error. Now it is not duplicated, those will be notified in _free_non_running_requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated the empty push to dropped handles into another step in stepping method.

src/cpp/src/sequence_group.hpp Outdated Show resolved Hide resolved
src/cpp/src/sequence_group.hpp Outdated Show resolved Hide resolved
@dkalinowski dkalinowski force-pushed the test4 branch 2 times, most recently from 4654360 to eba3844 Compare August 6, 2024 11:35
@@ -60,6 +60,15 @@ class ContinuousBatchingPipeline::Impl {
ChatHistory m_history;


void _notify_requests_dropped_by_handle() {
// Notify the last time by pushing empty output
// This causes read_all() to unblock by adding anything to the queue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This causes read_all() to unblock by adding anything to the queue
// This causes read() to unblock when called before status change by adding anything to the queue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

void GenerationHandleImpl::drop() {
m_generation_stream->drop();
}

std::unordered_map<uint64_t, GenerationOutput> GenerationHandleImpl::back() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should block all methods not just read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::unordered_map<uint64_t, GenerationOutput> GenerationHandleImpl::back() {
return m_generation_stream->back();
}

std::unordered_map<uint64_t, GenerationOutput> GenerationHandleImpl::read() {
OPENVINO_ASSERT(!is_dropped(), "Read cannot be called while underlying GenerationStream is already in dropped by handle state.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OPENVINO_ASSERT(!is_dropped(), "Read cannot be called while underlying GenerationStream is already in dropped by handle state.");
OPENVINO_ASSERT(!is_dropped(), "GenerationHandle cannot be used after it is dropped.");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dkalinowski
Copy link
Collaborator Author

@ilya-lavrenov

is it possible to cover such scenario with tests ?

I don't think we can test it on unit test level, we would need to work with real models.
For functional tests (tests/python_tests), the drop API (and GenerationHandle in general) is not exposed.
I'm currently adding unit tests within OVMS to test disconnections: openvinotoolkit/model_server#2610

@ilya-lavrenov
Copy link
Contributor

@ilya-lavrenov

is it possible to cover such scenario with tests ?

I don't think we can test it on unit test level, we would need to work with real models. For functional tests (tests/python_tests), the drop API (and GenerationHandle in general) is not exposed. I'm currently adding unit tests within OVMS to test disconnections: openvinotoolkit/model_server#2610

You can use real models in our tests - we use them in a lot of our tests.

@mzegla
Copy link
Collaborator

mzegla commented Aug 6, 2024

Real models are used in Python tests, so we would need to expose GenerationHandle via python, so new tests can use lower level API since we can't test it when running pipeline via generate() calls.

@@ -60,6 +60,15 @@ class ContinuousBatchingPipeline::Impl {
ChatHistory m_history;


void _notify_requests_dropped_by_handle() {
// Notify the last time by pushing empty output
// This causes read() to unblock by adding anything to the queue
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with this comment I don't fully understand why do we need to send empty outputs?
If handle is dropped by user, user should not expect any empty outputs from this request / handle

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's additional protection in multithreading scenarios. When generation handle is dropped from thread #1 and thread #2 is blocked on read() this operation unlocks it. That's the case in model server where handle drop is called from a callback triggered by HTTP server on client disconnect.

@ilya-lavrenov ilya-lavrenov added this pull request to the merge queue Aug 6, 2024
Merged via the queue into openvinotoolkit:master with commit 0be2620 Aug 6, 2024
33 checks passed
dkalinowski added a commit to openvinotoolkit/model_server that referenced this pull request Aug 9, 2024
…d streaming) (#2610)

* Patch tensorflow net_http to allow for installing client disconnection callbacks
* Use new genai, add tests, fix building without mediapipe, disconnect unary as well
* Tests

CVS-148134

Modifications to GenAI:
openvinotoolkit/openvino.genai#732
@ilya-lavrenov ilya-lavrenov added the category: continuous batching Continuous batching label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants