Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tablets] Skip any decommission/remove/terminate node nemesis from running when the number of nodes is not larger than the number of replicas #7218

Open
ShlomiBalalis opened this issue Feb 19, 2024 · 5 comments

Comments

@ShlomiBalalis
Copy link
Contributor

Issue description

In such nemeses, when trying to drain a node and the total number of nodes is not larger than the number of replicas, the drain process fails:

2024-02-05 11:32:23.970 <2024-02-05 11:32:23.869>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=147b21f0-4531-4dd4-9163-8e006a48fd33: type=RUNTIME_ERROR regex=std::runtime_error line_number=3299 node=longevity-5gb-1h-TerminateAndRemove-db-node-659547b4-1
2024-02-05T11:32:23.869+00:00 longevity-5gb-1h-TerminateAndRemove-db-node-659547b4-1      !ERR | scylla[6009]:  [shard 0:strm] raft_topology - tablets draining failed with std::runtime_error (Unable to find new replica for tablet 645da7f0-c419-11ee-8ee9-26c98a3da6ca:1 on daf03d5e-691b-448c-9a59-4dc058c920fd:1 when draining {daf03d5e-691b-448c-9a59-4dc058c920fd}). Aborting the topology operation

2024-02-05 11:32:23.976 <2024-02-05 11:32:23.869>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=147b21f0-4531-4dd4-9163-8e006a48fd33: type=RUNTIME_ERROR regex=std::runtime_error line_number=3307 node=longevity-5gb-1h-TerminateAndRemove-db-node-659547b4-1
2024-02-05T11:32:23.869+00:00 longevity-5gb-1h-TerminateAndRemove-db-node-659547b4-1      !ERR | scylla[6009]:  [shard 0:strm] raft_topology - Removenode failed. See earlier errors (Rolled back: Failed to drain tablets: std::runtime_error (Unable to find new replica for tablet 645da7f0-c419-11ee-8ee9-26c98a3da6ca:1 on daf03d5e-691b-448c-9a59-4dc058c920fd:1 when draining {daf03d5e-691b-448c-9a59-4dc058c920fd}))

We should add a check at the start of those nemeses, and skip them if needed

@ShlomiBalalis ShlomiBalalis self-assigned this Feb 19, 2024
@roydahan
Copy link
Contributor

@ShlomiBalalis This is maybe an enhancement but it shouldn't block you in any case.
If it's something that "blocks" you, the first thing you should do is to change the case or to add a new case to have the configuration you want.

@roydahan
Copy link
Contributor

@ShlomiBalalis I don't know what is the suggested procedure from product, they should supply you the procedure based on use cases.
(E.g. replacing live node, dead node, shrink cluster).

When ALTER will be supported this case should use ALTER, this will also cover you the case for ALTER keysapce with RF change.

@bhalevy
Copy link
Member

bhalevy commented Jul 23, 2024

@ShlomiBalalis please close if everything was dealt with

@yarongilor
Copy link
Contributor

Should be addressed in #7625

@swasik
Copy link

swasik commented Nov 6, 2024

Should be addressed in #7625

@yarongilor #7625 is merged now so could we close this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants