Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(NodeBootstrapAbortManager): use correct timeouts for operations #7004

Merged
merged 1 commit into from
Dec 26, 2023

Conversation

aleksbykov
Copy link
Contributor

@aleksbykov aleksbykov commented Dec 25, 2023

NodeBootstrapAbortManager run in parallel two operations with ParallelObject:

  1. bootstrap (node_setup)
  2. abort_operation (stop_scylla)

ParallelObject has the common timeout for 2 operations equals
Instance_start + log message wait timeout

Bootstrap operation has default timeout 3600 ( but operation have to be aborted earlier)

abort_operation (stop_scylla) has wait timeout for log message equals
Instance_start + log message wait timeout

Usually abort operation triggered much earlier, when required log message appeared in log and
abort action is going to be executed.

But during job: longevity-100gb-4h-fips-test, required log message was not found during
required timeout. Parallel object exited and next steps of nemesis were started,
But because Threads initiated by ParallelObject are still in state running,
ParallelObject can't stop them and they still running in background.
It was the reason of race between threads, and that all scylla data were
removed while it had been starting and triggered coredump.

Fix changes expected timeouts to avoid races between threads
and sure that all operations finished as expected

Fix #6568

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevent to this change (if needed)

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the whole nemesis is built on timing of log messages, we should find/ask for APIs to replace it, logging on a bigger scale isn't going to be very accurate

@@ -0,0 +1,37 @@
test_duration: 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the hydra command to generate the nemesis pipeline, don't copy it by hand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fruch your comment is for the pipeline, right>
Not for the yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this commit

@roydahan
Copy link
Contributor

If the whole nemesis is built on timing of log messages, we should find/ask for APIs to replace it, logging on a bigger scale isn't going to be very accurate

I agree with @fruch here.
This PR should get in anyway to fix the current issue.
However, @aleksbykov and @temichus, please have a task to harden this nemesis to be more deterministic and less relying on log messages.
We already know that SCT can handle log messages in a big delay when we have too many messages.

NodeBootstrapAbortManager run in parallel two operations with ParallelObject:
1) bootstrap (node_setup)
2) abort_operation (stop_scylla)

ParallelObject has the common timeout for 2 operations equals
`Instance_start + log message wait timeout`

Bootstrap operation has default timeout 3600 ( but operation have to be aborted earlier)

abort_operation (stop_scylla) has wait timeout for log message equals
Instance_start + log message wait timeout

Usually abort operation triggered much earlier, when required log message appeared in log and
abort action is going to be executed.

But during job: longevity-100gb-4h-fips-test, required log message was not found during
required timeout. Parallel object exited and next steps of nemesis were started,
But because Threads initiated by ParallelObject are still in state running,
ParallelObject can't stop them and they still running in background.
It was the reason of race between threads, and that all scylla data were
removed while it had been starting and triggered coredump.

Fix changes expected timeouts to avoid races between threads
and sure that all operations finished as expected

Fix scylladb#6568
@aleksbykov aleksbykov force-pushed the fix-timeout-race-between-ops branch from 4026d87 to 15971e5 Compare December 26, 2023 06:49
@aleksbykov aleksbykov requested a review from fruch December 26, 2023 06:50
@aleksbykov
Copy link
Contributor Author

If the whole nemesis is built on timing of log messages, we should find/ask for APIs to replace it, logging on a bigger scale isn't going to be very accurate

I agree with @fruch here. This PR should get in anyway to fix the current issue. However, @aleksbykov and @temichus, please have a task to harden this nemesis to be more deterministic and less relying on log messages. We already know that SCT can handle log messages in a big delay when we have too many messages.

I agree that log message could be not good choice. But we don't have another marker what process step is running right now. When raft topology will be GA feature, then we can try to switch to raft history or raft state table to get some process state.
The task is created: https://github.com/scylladb/qa-tasks/issues/1590

@fruch
Copy link
Contributor

fruch commented Dec 26, 2023

@aleksbykov it's quite confusion that your runs pipelines ends with failure, not that I can understand why

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch fruch merged commit 142700a into scylladb:master Dec 26, 2023
5 checks passed
@fruch fruch added backport/2024.1-done Commit backported to 2024.1 backport/5.4-done Commit backported to 5.4 labels Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/5.4-done Commit backported to 5.4 backport/5.4 Need backport to 5.4 backport/2024.1-done Commit backported to 2024.1 backport/2024.1 Need backport to 2024.1 Ready for review
Projects
None yet
3 participants