-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004
fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the whole nemesis is built on timing of log messages, we should find/ask for APIs to replace it, logging on a bigger scale isn't going to be very accurate
@@ -0,0 +1,37 @@ | |||
test_duration: 90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the hydra command to generate the nemesis pipeline, don't copy it by hand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fruch your comment is for the pipeline, right>
Not for the yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this commit
I agree with @fruch here. |
NodeBootstrapAbortManager run in parallel two operations with ParallelObject: 1) bootstrap (node_setup) 2) abort_operation (stop_scylla) ParallelObject has the common timeout for 2 operations equals `Instance_start + log message wait timeout` Bootstrap operation has default timeout 3600 ( but operation have to be aborted earlier) abort_operation (stop_scylla) has wait timeout for log message equals Instance_start + log message wait timeout Usually abort operation triggered much earlier, when required log message appeared in log and abort action is going to be executed. But during job: longevity-100gb-4h-fips-test, required log message was not found during required timeout. Parallel object exited and next steps of nemesis were started, But because Threads initiated by ParallelObject are still in state running, ParallelObject can't stop them and they still running in background. It was the reason of race between threads, and that all scylla data were removed while it had been starting and triggered coredump. Fix changes expected timeouts to avoid races between threads and sure that all operations finished as expected Fix scylladb#6568
4026d87
to
15971e5
Compare
I agree that log message could be not good choice. But we don't have another marker what process step is running right now. When raft topology will be GA feature, then we can try to switch to raft history or raft state table to get some process state. |
@aleksbykov it's quite confusion that your runs pipelines ends with failure, not that I can understand why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
NodeBootstrapAbortManager run in parallel two operations with ParallelObject:
ParallelObject has the common timeout for 2 operations equals
Instance_start + log message wait timeout
Bootstrap operation has default timeout 3600 ( but operation have to be aborted earlier)
abort_operation (stop_scylla) has wait timeout for log message equals
Instance_start + log message wait timeout
Usually abort operation triggered much earlier, when required log message appeared in log and
abort action is going to be executed.
But during job:
longevity-100gb-4h-fips-test
, required log message was not found duringrequired timeout. Parallel object exited and next steps of nemesis were started,
But because Threads initiated by ParallelObject are still in state running,
ParallelObject can't stop them and they still running in background.
It was the reason of race between threads, and that all scylla data were
removed while it had been starting and triggered coredump.
Fix changes expected timeouts to avoid races between threads
and sure that all operations finished as expected
Fix #6568
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)