-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: add batch timeouts #3383
base: master
Are you sure you want to change the base?
Conversation
What problem are you seeing today that you're trying to solve? I.e., why do you want to set |
@vangent our throughput is quite high — in the thousands to tens of thousands a second / tens to hundreds per batch, per container. The problem lies with graceful shutdown and the distribution of work. On Without a timeout, a concern is that the min batch size may never be met in the case of shutdown, leading to messages that aren't flushed. Timeout will ensure that this finishes; we can extend the grace period of shutdowns by at least timeout. Timeout is also a guarantee on minimum QoS in the case of unpredictable issues. |
Interesting, thanks for the description. What do you hope to improve by increasing the Context: This proposal would add a fair amount of complexity to an already-quite-complicated system, so I'd prefer to avoid it if possible. Every time I touch this code I have to page in a bunch of context, run tests and benchmarks, and there are often surprising side effects for various edge cases. Flushing on Shutdown seems like it might be a simpler approach...? |
I thought about flushing on shutdown and had a note in the comments/PR (or so I thought)! Totally understand wanting to keep things simple, too. That's my entire approach where possible :) More context: with Google Pub/Sub we're sometimes seeing context timeouts directly in the gRPC layer. This happens even when passing fresh contexts that have no timeouts/deadline into publish. Google Pub/Sub's accept latency is ~3ms, and so we're suspecting that we're sending too many requests and not batching enough. This is corroborated with Pub/Sub's dashboards: we have thousands of RPS, and batches are smaller than we expect for our throughput (again, many thousands per second, though batch sizes are sometimes single digits). I was hoping to use minimum batch sizes of, say, 10 or 20, to increase batch sizes and throughput at the cost of some latency while still being "safe". Ideally, timeouts would let us do this while also capping the max latency we can introduce, ensuring decent QoS. Definitely in agreement that shutdown should flush batches. I'll follow up in a separate PR with that. All of that considered, I totally understand the increased complexity. What are your thoughts on the QoS/latency guarantees and the overall context added here? |
…n't met This PR ensures that the batcher flushes on shutdown, even if the pending length is less than the min batch size specified. Sending events is preferred to dropping, even if limits are not obeyed.
@vangent here's a flush on shutdown PR with tests: #3386. Edit: I've rebased this PR on the flush-on-shutdown PR to clean up the code. In doing so, I've also cleaned up some of the logic and branching for simplicity & ease of understanding. No one wants huge if statements and complex branching. This now also contains tests. My personal preference (FWIW, as a non-maintainer of course) is that they'd both land — minimum latency guarantees and flushes for safety. Of course, that's easy for me to say :) |
This PR adds a timeout option to batcher. The timeout specifies how long the batcher waits before sending batched messages, even if the message length is under MinBatchSize. This ensures delivery guarantees within a given time period if there are low-throughput messages with a higher min batch size. To do: - [x] Implement batch timeouts in options - [x] Only allow one concurrent request to check for timeouts - [x] Handle timeouts when grabbing next batch - [ ] Tests
bd8cc4e
to
e6a3015
Compare
338136c
to
bbe768a
Compare
I think the flushing PR looks close to good, I am holding off on reviewing this one until the open questions there are addressed and it is submitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get the "flush" PR merged first.
This PR adds a timeout option to batcher. The timeout specifies how long the batcher waits before sending batched messages, even if the message length is under MinBatchSize.
This ensures delivery guarantees within a given time period if there are low-throughput messages with a min batch size that may not be frequently met.
We'd like to set the minimum batch size, but had concerns that messages may not be sent if the minimum isn't met. One other consideration may be to flush batches on shutdown.
This PR is the barebones concept around batch timeouts, with a minimal untested plan. If this looks good to the gocloud team, we'll continue working on this and add tests for upstream.
What are your thoughts?
EDIT: This is now a PR on top of #3386.
To do: