pubsub: add batch timeouts #3383

tonyhb · 2024-02-23T17:48:52Z

This PR adds a timeout option to batcher. The timeout specifies how long the batcher waits before sending batched messages, even if the message length is under MinBatchSize.

This ensures delivery guarantees within a given time period if there are low-throughput messages with a min batch size that may not be frequently met.

We'd like to set the minimum batch size, but had concerns that messages may not be sent if the minimum isn't met. One other consideration may be to flush batches on shutdown.

This PR is the barebones concept around batch timeouts, with a minimal untested plan. If this looks good to the gocloud team, we'll continue working on this and add tests for upstream.

What are your thoughts?

EDIT: This is now a PR on top of #3386.

To do:

Implement batch timeouts in options
Only allow one concurrent request to check for timeouts
Handle timeouts when grabbing next batch
Tests

vangent · 2024-02-25T02:25:57Z

What problem are you seeing today that you're trying to solve? I.e., why do you want to set MinBatchSize to something greater than 1 if your throughput is low?

tonyhb · 2024-02-27T14:15:02Z

@vangent our throughput is quite high — in the thousands to tens of thousands a second / tens to hundreds per batch, per container.

The problem lies with graceful shutdown and the distribution of work. On SIGINT, machines will stop processing new work. Outstanding work will continue. Most work finishes in milliseconds, but outliers take minutes to hours. As these outliers finish, the throughput drops dramatically, sometimes even to a single job. In this case, we still need pubsub to continue to send.

Without a timeout, a concern is that the min batch size may never be met in the case of shutdown, leading to messages that aren't flushed. Timeout will ensure that this finishes; we can extend the grace period of shutdowns by at least timeout. Timeout is also a guarantee on minimum QoS in the case of unpredictable issues.

vangent · 2024-02-27T18:47:26Z

Interesting, thanks for the description. What do you hope to improve by increasing the MinBatchSize above 1?

Context: This proposal would add a fair amount of complexity to an already-quite-complicated system, so I'd prefer to avoid it if possible. Every time I touch this code I have to page in a bunch of context, run tests and benchmarks, and there are often surprising side effects for various edge cases.

Flushing on Shutdown seems like it might be a simpler approach...?

tonyhb · 2024-02-27T21:10:58Z

I thought about flushing on shutdown and had a note in the comments/PR (or so I thought)! Totally understand wanting to keep things simple, too. That's my entire approach where possible :)

More context: with Google Pub/Sub we're sometimes seeing context timeouts directly in the gRPC layer. This happens even when passing fresh contexts that have no timeouts/deadline into publish. Google Pub/Sub's accept latency is ~3ms, and so we're suspecting that we're sending too many requests and not batching enough. This is corroborated with Pub/Sub's dashboards: we have thousands of RPS, and batches are smaller than we expect for our throughput (again, many thousands per second, though batch sizes are sometimes single digits).

I was hoping to use minimum batch sizes of, say, 10 or 20, to increase batch sizes and throughput at the cost of some latency while still being "safe". Ideally, timeouts would let us do this while also capping the max latency we can introduce, ensuring decent QoS.

Definitely in agreement that shutdown should flush batches. I'll follow up in a separate PR with that.

All of that considered, I totally understand the increased complexity. What are your thoughts on the QoS/latency guarantees and the overall context added here?

…n't met This PR ensures that the batcher flushes on shutdown, even if the pending length is less than the min batch size specified. Sending events is preferred to dropping, even if limits are not obeyed.

tonyhb · 2024-02-27T21:41:52Z

@vangent here's a flush on shutdown PR with tests: #3386.

Edit:

I've rebased this PR on the flush-on-shutdown PR to clean up the code. In doing so, I've also cleaned up some of the logic and branching for simplicity & ease of understanding. No one wants huge if statements and complex branching. This now also contains tests.

My personal preference (FWIW, as a non-maintainer of course) is that they'd both land — minimum latency guarantees and flushes for safety. Of course, that's easy for me to say :)

This PR adds a timeout option to batcher. The timeout specifies how long the batcher waits before sending batched messages, even if the message length is under MinBatchSize. This ensures delivery guarantees within a given time period if there are low-throughput messages with a higher min batch size. To do: - [x] Implement batch timeouts in options - [x] Only allow one concurrent request to check for timeouts - [x] Handle timeouts when grabbing next batch - [ ] Tests

vangent · 2024-06-03T03:26:23Z

I think the flushing PR looks close to good, I am holding off on reviewing this one until the open questions there are addressed and it is submitted.

vangent

Let's get the "flush" PR merged first.

pubsub: Ensure batcher flushes on shutdown, even if min batch size is…

7672fca

…n't met This PR ensures that the batcher flushes on shutdown, even if the pending length is less than the min batch size specified. Sending events is preferred to dropping, even if limits are not obeyed.

tonyhb mentioned this pull request Feb 27, 2024

pubsub: Ensure batcher flushes on shutdown, even if min batch size isn't met #3386

Open

tonyhb added 2 commits February 27, 2024 13:49

Allow for nil min time, reset batch timeout atomic

e6a3015

tonyhb force-pushed the chore/add-batch-timeouts branch from bd8cc4e to e6a3015 Compare February 27, 2024 21:56

Add tests for batch timeouts

bbe768a

tonyhb force-pushed the chore/add-batch-timeouts branch from 338136c to bbe768a Compare February 27, 2024 22:10

vangent requested changes Jun 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pubsub: add batch timeouts #3383

pubsub: add batch timeouts #3383

tonyhb commented Feb 23, 2024 •

edited

Loading

vangent commented Feb 25, 2024

tonyhb commented Feb 27, 2024 •

edited

Loading

vangent commented Feb 27, 2024

tonyhb commented Feb 27, 2024 •

edited

Loading

tonyhb commented Feb 27, 2024 •

edited

Loading

vangent commented Jun 3, 2024

vangent left a comment

pubsub: add batch timeouts #3383

Are you sure you want to change the base?

pubsub: add batch timeouts #3383

Conversation

tonyhb commented Feb 23, 2024 • edited Loading

vangent commented Feb 25, 2024

tonyhb commented Feb 27, 2024 • edited Loading

vangent commented Feb 27, 2024

tonyhb commented Feb 27, 2024 • edited Loading

tonyhb commented Feb 27, 2024 • edited Loading

vangent commented Jun 3, 2024

vangent left a comment

Choose a reason for hiding this comment

tonyhb commented Feb 23, 2024 •

edited

Loading

tonyhb commented Feb 27, 2024 •

edited

Loading

tonyhb commented Feb 27, 2024 •

edited

Loading

tonyhb commented Feb 27, 2024 •

edited

Loading