fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004

aleksbykov · 2023-12-25T10:13:59Z

NodeBootstrapAbortManager run in parallel two operations with ParallelObject:

bootstrap (node_setup)
abort_operation (stop_scylla)

ParallelObject has the common timeout for 2 operations equals
Instance_start + log message wait timeout

Bootstrap operation has default timeout 3600 ( but operation have to be aborted earlier)

abort_operation (stop_scylla) has wait timeout for log message equals
Instance_start + log message wait timeout

Usually abort operation triggered much earlier, when required log message appeared in log and
abort action is going to be executed.

But during job: longevity-100gb-4h-fips-test, required log message was not found during
required timeout. Parallel object exited and next steps of nemesis were started,
But because Threads initiated by ParallelObject are still in state running,
ParallelObject can't stop them and they still running in background.
It was the reason of race between threads, and that all scylla data were
removed while it had been starting and triggered coredump.

Fix changes expected timeouts to avoid races between threads
and sure that all operations finished as expected

Fix #6568

Testing

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevent to this change (if needed)

fruch

If the whole nemesis is built on timing of log messages, we should find/ask for APIs to replace it, logging on a bigger scale isn't going to be very accurate

fruch · 2023-12-25T11:50:58Z

test-cases/nemesis/longevity-5gb-1h-BootstrapStreamingErrMonkey.yaml

@@ -0,0 +1,37 @@
+test_duration: 90


Please use the hydra command to generate the nemesis pipeline, don't copy it by hand

@fruch your comment is for the pipeline, right>
Not for the yaml.

Removed this commit

roydahan · 2023-12-26T00:19:12Z

If the whole nemesis is built on timing of log messages, we should find/ask for APIs to replace it, logging on a bigger scale isn't going to be very accurate

I agree with @fruch here.
This PR should get in anyway to fix the current issue.
However, @aleksbykov and @temichus, please have a task to harden this nemesis to be more deterministic and less relying on log messages.
We already know that SCT can handle log messages in a big delay when we have too many messages.

NodeBootstrapAbortManager run in parallel two operations with ParallelObject: 1) bootstrap (node_setup) 2) abort_operation (stop_scylla) ParallelObject has the common timeout for 2 operations equals `Instance_start + log message wait timeout` Bootstrap operation has default timeout 3600 ( but operation have to be aborted earlier) abort_operation (stop_scylla) has wait timeout for log message equals Instance_start + log message wait timeout Usually abort operation triggered much earlier, when required log message appeared in log and abort action is going to be executed. But during job: longevity-100gb-4h-fips-test, required log message was not found during required timeout. Parallel object exited and next steps of nemesis were started, But because Threads initiated by ParallelObject are still in state running, ParallelObject can't stop them and they still running in background. It was the reason of race between threads, and that all scylla data were removed while it had been starting and triggered coredump. Fix changes expected timeouts to avoid races between threads and sure that all operations finished as expected Fix scylladb#6568

aleksbykov · 2023-12-26T07:07:26Z

If the whole nemesis is built on timing of log messages, we should find/ask for APIs to replace it, logging on a bigger scale isn't going to be very accurate

I agree with @fruch here. This PR should get in anyway to fix the current issue. However, @aleksbykov and @temichus, please have a task to harden this nemesis to be more deterministic and less relying on log messages. We already know that SCT can handle log messages in a big delay when we have too many messages.

I agree that log message could be not good choice. But we don't have another marker what process step is running right now. When raft topology will be GA feature, then we can try to switch to raft history or raft state table to get some process state.
The task is created: https://github.com/scylladb/qa-tasks/issues/1590

fruch · 2023-12-26T09:25:04Z

@aleksbykov it's quite confusion that your runs pipelines ends with failure, not that I can understand why

fruch

LGTM

aleksbykov added Ready for review backport/5.4 Need backport to 5.4 backport/2024.1 Need backport to 2024.1 labels Dec 25, 2023

github-actions bot assigned aleksbykov Dec 25, 2023

aleksbykov marked this pull request as ready for review December 25, 2023 10:37

aleksbykov requested review from fruch, temichus, soyacz and roydahan December 25, 2023 10:37

aleksbykov mentioned this pull request Dec 25, 2023

Error: table - Memtable flush failed due to: std::filesystem::__cxx11::filesystem_error and coredump during second bootstrap after first was aborted #6865

Closed

2 tasks

fruch reviewed Dec 25, 2023

View reviewed changes

aleksbykov force-pushed the fix-timeout-race-between-ops branch from 4026d87 to 15971e5 Compare December 26, 2023 06:49

aleksbykov requested a review from fruch December 26, 2023 06:50

fruch approved these changes Dec 26, 2023

View reviewed changes

fruch merged commit 142700a into scylladb:master Dec 26, 2023
5 checks passed

fruch added backport/2024.1-done Commit backported to 2024.1 backport/5.4-done Commit backported to 5.4 labels Dec 26, 2023

aleksbykov mentioned this pull request Mar 20, 2024

Fix and refactor *OperationsAbort nemesises #7286

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004

fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004

aleksbykov commented Dec 25, 2023 •

edited

Loading

fruch left a comment

fruch Dec 25, 2023

roydahan Dec 26, 2023

aleksbykov Dec 26, 2023

roydahan commented Dec 26, 2023

aleksbykov commented Dec 26, 2023

fruch commented Dec 26, 2023

fruch left a comment

fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004

fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004

Conversation

aleksbykov commented Dec 25, 2023 • edited Loading

Testing

PR pre-checks (self review)

Reminders

fruch left a comment

Choose a reason for hiding this comment

fruch Dec 25, 2023

Choose a reason for hiding this comment

roydahan Dec 26, 2023

Choose a reason for hiding this comment

aleksbykov Dec 26, 2023

Choose a reason for hiding this comment

roydahan commented Dec 26, 2023

aleksbykov commented Dec 26, 2023

fruch commented Dec 26, 2023

fruch left a comment

Choose a reason for hiding this comment

aleksbykov commented Dec 25, 2023 •

edited

Loading