Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: add batch timeouts #3383

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tonyhb
Copy link

@tonyhb tonyhb commented Feb 23, 2024

This PR adds a timeout option to batcher. The timeout specifies how long the batcher waits before sending batched messages, even if the message length is under MinBatchSize.

This ensures delivery guarantees within a given time period if there are low-throughput messages with a min batch size that may not be frequently met.

We'd like to set the minimum batch size, but had concerns that messages may not be sent if the minimum isn't met. One other consideration may be to flush batches on shutdown.

This PR is the barebones concept around batch timeouts, with a minimal untested plan. If this looks good to the gocloud team, we'll continue working on this and add tests for upstream.

What are your thoughts?

EDIT: This is now a PR on top of #3386.

To do:

  • Implement batch timeouts in options
  • Only allow one concurrent request to check for timeouts
  • Handle timeouts when grabbing next batch
  • Tests

@vangent
Copy link
Contributor

vangent commented Feb 25, 2024

What problem are you seeing today that you're trying to solve? I.e., why do you want to set MinBatchSize to something greater than 1 if your throughput is low?

@tonyhb
Copy link
Author

tonyhb commented Feb 27, 2024

@vangent our throughput is quite high — in the thousands to tens of thousands a second / tens to hundreds per batch, per container.

The problem lies with graceful shutdown and the distribution of work. On SIGINT, machines will stop processing new work. Outstanding work will continue. Most work finishes in milliseconds, but outliers take minutes to hours. As these outliers finish, the throughput drops dramatically, sometimes even to a single job. In this case, we still need pubsub to continue to send.

Without a timeout, a concern is that the min batch size may never be met in the case of shutdown, leading to messages that aren't flushed. Timeout will ensure that this finishes; we can extend the grace period of shutdowns by at least timeout. Timeout is also a guarantee on minimum QoS in the case of unpredictable issues.

@vangent
Copy link
Contributor

vangent commented Feb 27, 2024

Interesting, thanks for the description. What do you hope to improve by increasing the MinBatchSize above 1?

Context: This proposal would add a fair amount of complexity to an already-quite-complicated system, so I'd prefer to avoid it if possible. Every time I touch this code I have to page in a bunch of context, run tests and benchmarks, and there are often surprising side effects for various edge cases.

Flushing on Shutdown seems like it might be a simpler approach...?

@tonyhb
Copy link
Author

tonyhb commented Feb 27, 2024

I thought about flushing on shutdown and had a note in the comments/PR (or so I thought)! Totally understand wanting to keep things simple, too. That's my entire approach where possible :)

More context: with Google Pub/Sub we're sometimes seeing context timeouts directly in the gRPC layer. This happens even when passing fresh contexts that have no timeouts/deadline into publish. Google Pub/Sub's accept latency is ~3ms, and so we're suspecting that we're sending too many requests and not batching enough. This is corroborated with Pub/Sub's dashboards: we have thousands of RPS, and batches are smaller than we expect for our throughput (again, many thousands per second, though batch sizes are sometimes single digits).

I was hoping to use minimum batch sizes of, say, 10 or 20, to increase batch sizes and throughput at the cost of some latency while still being "safe". Ideally, timeouts would let us do this while also capping the max latency we can introduce, ensuring decent QoS.

Definitely in agreement that shutdown should flush batches. I'll follow up in a separate PR with that.

All of that considered, I totally understand the increased complexity. What are your thoughts on the QoS/latency guarantees and the overall context added here?

…n't met

This PR ensures that the batcher flushes on shutdown, even if the
pending length is less than the min batch size specified.  Sending
events is preferred to dropping, even if limits are not obeyed.
@tonyhb
Copy link
Author

tonyhb commented Feb 27, 2024

@vangent here's a flush on shutdown PR with tests: #3386.

Edit:

I've rebased this PR on the flush-on-shutdown PR to clean up the code. In doing so, I've also cleaned up some of the logic and branching for simplicity & ease of understanding. No one wants huge if statements and complex branching. This now also contains tests.

My personal preference (FWIW, as a non-maintainer of course) is that they'd both land — minimum latency guarantees and flushes for safety. Of course, that's easy for me to say :)

This PR adds a timeout option to batcher.  The timeout specifies how
long the batcher waits before sending batched messages, even if the
message length is under MinBatchSize.

This ensures delivery guarantees within a given time period if there
are low-throughput messages with a higher min batch size.

To do:

- [x] Implement batch timeouts in options
- [x] Only allow one concurrent request to check for timeouts
- [x] Handle timeouts when grabbing next batch
- [ ] Tests
@tonyhb tonyhb force-pushed the chore/add-batch-timeouts branch from bd8cc4e to e6a3015 Compare February 27, 2024 21:56
@tonyhb tonyhb force-pushed the chore/add-batch-timeouts branch from 338136c to bbe768a Compare February 27, 2024 22:10
@vangent
Copy link
Contributor

vangent commented Jun 3, 2024

I think the flushing PR looks close to good, I am holding off on reviewing this one until the open questions there are addressed and it is submitted.

Copy link
Contributor

@vangent vangent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get the "flush" PR merged first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants