Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete pod to stop a node instead of scaling sts #15214

Closed
wants to merge 2 commits into from

Conversation

aluon
Copy link
Contributor

@aluon aluon commented Nov 6, 2024

Description

Add a method to temporarily stop a fullnode/validator node by repeatedly deleting the pod. This is used for the fullnode/validator stress tests where we want to keep to keep the underlying node allocated to the StatefulSet. We suspect that the previous method of scaling the StatefulSet was causing node allocation delays and causing these test to timeout

How Has This Been Tested?

Ran the adhoc forge workflow

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Nov 6, 2024

⏱️ 1h 42m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
test-target-determinator 29m 🟩🟩🟩🟩 (+2 more)
adhoc-forge-test / forge 13m 🟩
adhoc-forge-test / forge 13m 🟥
rust-cargo-deny 11m 🟩🟩🟩🟩🟥 (+2 more)
check-dynamic-deps 10m 🟩🟩🟩🟩🟩 (+3 more)
semgrep/ci 4m 🟩🟩🟩🟩🟩 (+3 more)
rust-move-tests 3m 🟩
general-lints 3m 🟩🟩🟩🟩🟩 (+2 more)
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+2 more)

settingsfeedbackdocs ⋅ learn more about trunk.io

@aluon aluon added the CICD:build-images when this label is present github actions will start build+push rust images from the PR. label Nov 6, 2024
@aluon aluon force-pushed the aluon/push-lrkyvlwzlkxm branch 3 times, most recently from 89d64d3 to 3f50267 Compare November 6, 2024 20:25
@aluon aluon requested review from ibalajiarun, perryjrandall, vusirikala and a team November 6, 2024 20:41
@aluon aluon marked this pull request as ready for review November 6, 2024 20:41

// Keep deleting the pod if it recovers before the deadline
while Instant::now() < deadline {
match self.wait_until_healthy(deadline).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting till healthy will make the node catch up to latest state, which is something we don't want to do. Can we wait till the pod is running instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that match the existing behavior? These tests were using Node::start() which calls wait_until_healthy()

I updated this to check the pod status instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, before we had stop() -> sleep -> start() -> healthy. Now, kill() -> healthy() -> kill() -> healthy() -> ... right?

@aluon aluon force-pushed the aluon/push-lrkyvlwzlkxm branch from 3f50267 to 3d76c7f Compare November 8, 2024 18:11
@aluon aluon force-pushed the aluon/push-lrkyvlwzlkxm branch from 9b60399 to 0fcc4e0 Compare November 8, 2024 18:49
@aluon
Copy link
Contributor Author

aluon commented Nov 8, 2024

Looks like these tests are still timing out with these changes. I'm going to try increasing the timeout

@aluon
Copy link
Contributor Author

aluon commented Nov 14, 2024

This didn't work. We're still seeing delays in pod startup due to reattaching PVCs

I increased the timeouts in #15244 to get the Forge tests working again. We can explore using node affinities to avoid pods getting moved between nodes

@aluon aluon closed this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-images when this label is present github actions will start build+push rust images from the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants