Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a caution note about making apply context bounded #18674

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions server/etcdserver/apply/uber_applier.go
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,11 @@ func (a *uberApplier) Apply(r *pb.InternalRaftRequest) *Result {
// then dispatch() unpacks the request to a specific method (like Put),
// that gets executed down the hierarchy again:
// i.e. CorruptApplier.Put(CappedApplier.Put(...(BackendApplier.Put(...)))).
//
// CAUTION: The context below should NOT be changed to a bounded value without
// first addressing the risk of a transaction's operations being only partially
// applied when some operations timeout. More details here:
// https://github.com/etcd-io/etcd/issues/18667#issuecomment-2392286839
Comment on lines +118 to +122
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//
// CAUTION: The context below should NOT be changed to a bounded value without
// first addressing the risk of a transaction's operations being only partially
// applied when some operations timeout. More details here:
// https://github.com/etcd-io/etcd/issues/18667#issuecomment-2392286839
//
// CAUTION: Do NOT change the context below to have a timeout (i.e., bounded value).
// The apply workflow may be intentionally interrupted for expected reasons, but it
// should never fail due to non-deterministic factors such as a context timeout. If
// the workflow is interrupted by such factors, it can lead to a scenario where some
// members apply changes successfully while others fail, potentially causing data
// inconsistency issues like https://github.com/etcd-io/etcd/issues/18667.
// See also https://github.com/etcd-io/etcd/issues/18667#issuecomment-2392286839.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea that adding a comment is meant to protect us from disaster. Context timeout can be assigned at any lower level of the apply loop where is comment is not present, with same consequences.

Would prefer a slow methodical removal of context. First PR can just remove ctx from Apply method that we know uses context.TODO().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea that adding a comment is meant to protect us from disaster. Context timeout can be assigned at any lower level of the apply loop where is comment is not present, with same consequences.

It's just an easy & safe & temporary improvement for the existing situation. Overall not a big deal to me, so doesn't deserve too much discussing this. Either quickly approve & merge this PR or just reject it.

First PR can just remove ctx from Apply method that we know uses context.TODO().

I'd suggest to evaluate the effort & impact. Afterwards, we can breakdown it into PRs.

Copy link
Member

@serathius serathius Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to evaluate the effort & impact. Afterwards, we can breakdown it into PRs.

We already did in #18667 (comment), as I mentioned that context is just used in 2 ways, tracing and authorization metadata. Removing a argument from function is not that risky. If there is a risk we should mitigate it with testing, but should not just leave a comment and say we patched the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already did in #18667 (comment), as I mentioned that context is just used in 2 ways, tracing and authorization metadata.

I don't think a simple comment is enough. I expect a doc or summary to clarify why (of course it's already clear) and "how" you will resolve it, and the "impact" on the etcdserver. It would be perfect if we can have PoC PRs.

return a.applyV3.Apply(context.TODO(), r, a.dispatch)
}

Expand Down
Loading