Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add StepDownOnRemoval #79

Merged
merged 1 commit into from
Jul 17, 2023
Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jun 27, 2023

In 72d62a1, we made Raft leaders step down when they removed themselves from the group. However, this broke backwards compatibility. This patch makes the behavior opt-in via a new StepDownOnRemoval config option, with the intent to enable it unconditionally in the next major release.

In CockroachDB, we campaign a designated voter when the leader is removed from the group. However, during a joint config change, the leader is first demoted to learner before removal. The step-down logic introduced in 72d62a1 triggers on learner demotion (which is the right behavior), but this left the CockroachDB group without a leader, having to wait out an election timeout. While it is straightforward to change this behavior in CockroachDB, we have to maintain backwards compatibility with older nodes during rolling upgrades.

Signed-off-by: Erik Grinaker [email protected]

cc @ahrtr @tbg @pavelkalinnikov

My bad for not testing this change properly in the first place. I'll write up an issue to enable by default in the next major release.

@serathius
Copy link
Member

serathius commented Jun 27, 2023

This make me worried about lack of proper backward compatibility testing. We should not depend on CockroachDB to discover such issues. What would happened if CockroachDB missies a issue that impacts etcd. Etcd has much slower development cycle, results would be really bad.

Could someone on raft side take an action item to propose a test plan for backward compatibility?

@erikgrinaker
Copy link
Contributor Author

This change is arguably in the gray zone where a bug fix breaks backwards compatibility -- the fact that CockroachDB relies on a learner to commit and replicate the final conf change is problematic. Even so, it's a bug that we currently rely on.

The original PR included test changes for this change in behavior, so in some sense it was already covered by tests (and this PR increases that coverage). The change in behavior was considered beneficial, but in retrospect we should have recognized that any such change in behavior has potential to break something for someone, and should have gated it on a config option in the first place.

@tbg tbg requested a review from pav-kv June 30, 2023 12:21
@ahrtr
Copy link
Member

ahrtr commented Jul 10, 2023

in retrospect we should have recognized that any such change in behavior has potential to break something for someone, and should have gated it on a config option in the first place

Makes sense.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

thx

In 72d62a1, we made Raft leaders step down when they removed themselves
from the group. However, this broke backwards compatibility. This patch
makes the behavior opt-in via a new `StepDownOnRemoval` config option,
with the intent to enable it unconditionally in the next major release.

In CockroachDB, we campaign a designated voter when the leader is
removed from the group. However, during a joint config change, the
leader is first demoted to learner before removal. The step-down logic
introduced in 72d62a1 triggers on learner demotion (which is the right
behavior), but this left the CockroachDB group without a leader, having
to wait out an election timeout. While it is straightforward to change
this behavior in CockroachDB, we have to maintain backwards
compatibility with older nodes during rolling upgrades.

Signed-off-by: Erik Grinaker <[email protected]>
@erikgrinaker erikgrinaker force-pushed the step-down-on-removal branch from eaec4b0 to ee0fe9d Compare July 17, 2023 15:24
@tbg tbg merged commit 72a6e6c into etcd-io:main Jul 17, 2023
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2023
Picks up

- etcd-io/raft#77
- etcd-io/raft#79 (essentially an off-by-default for 77)
- etcd-io/raft#82
- etcd-io/raft#81 (needed for cockroachdb#105797)

Epic: none
Release note: None
@ahrtr ahrtr added this to the v3.6.0 milestone Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants