delete pod to stop a node instead of scaling sts #15214

aluon · 2024-11-06T19:11:11Z

Description

Add a method to temporarily stop a fullnode/validator node by repeatedly deleting the pod. This is used for the fullnode/validator stress tests where we want to keep to keep the underlying node allocated to the StatefulSet. We suspect that the previous method of scaling the StatefulSet was causing node allocation delays and causing these test to timeout

How Has This Been Tested?

Ran the adhoc forge workflow

Key Areas to Review

Type of Change

Which Components or Systems Does This Change Impact?

Checklist

I have read and followed the CONTRIBUTING doc
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I identified and added all stakeholders and component owners affected by this change as reviewers
I tested both happy and unhappy path of the functionality
I have made corresponding changes to the documentation

trunk-io · 2024-11-06T19:11:15Z

⏱️ 1h 42m total CI duration on this PR

Slowest 15 Jobs	Cumulative Duration	Recent Runs
test-target-determinator	29m	🟩 🟩 ⬜ 🟩 🟩 (+2 more)
adhoc-forge-test / forge	13m	🟩
adhoc-forge-test / forge	13m	🟥
rust-cargo-deny	11m	🟩 🟩 🟩 🟩 🟥 (+2 more)
check-dynamic-deps	10m	🟩 🟩 🟩 🟩 🟩 (+3 more)
semgrep/ci	4m	🟩 🟩 🟩 🟩 🟩 (+3 more)
rust-move-tests	3m	🟩
general-lints	3m	🟩 🟩 🟩 🟩 🟩 (+2 more)
rust-move-tests	2m	🟩
rust-move-tests	2m	🟩
rust-move-tests	2m	🟩
rust-move-tests	2m	🟩
rust-move-tests	2m	🟩
rust-move-tests	2m	🟩
file_change_determinator	1m	🟩 🟩 🟩 🟩 🟩 (+2 more)

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

ibalajiarun · 2024-11-06T21:20:42Z

testsuite/forge/src/backend/k8s/node.rs

+
+        // Keep deleting the pod if it recovers before the deadline
+        while Instant::now() < deadline {
+            match self.wait_until_healthy(deadline).await {


Waiting till healthy will make the node catch up to latest state, which is something we don't want to do. Can we wait till the pod is running instead?

Doesn't that match the existing behavior? These tests were using Node::start() which calls wait_until_healthy()

I updated this to check the pod status instead

No, before we had stop() -> sleep -> start() -> healthy. Now, kill() -> healthy() -> kill() -> healthy() -> ... right?

aluon · 2024-11-08T18:51:54Z

Looks like these tests are still timing out with these changes. I'm going to try increasing the timeout

aluon · 2024-11-14T20:27:59Z

This didn't work. We're still seeing delays in pod startup due to reattaching PVCs

I increased the timeouts in #15244 to get the Forge tests working again. We can explore using node affinities to avoid pods getting moved between nodes

aluon added the CICD:build-images when this label is present github actions will start build+push rust images from the PR. label Nov 6, 2024

aluon force-pushed the aluon/push-lrkyvlwzlkxm branch 3 times, most recently from 89d64d3 to 3f50267 Compare November 6, 2024 20:25

aluon requested review from ibalajiarun, perryjrandall, vusirikala and a team November 6, 2024 20:41

aluon marked this pull request as ready for review November 6, 2024 20:41

ibalajiarun reviewed Nov 6, 2024

View reviewed changes

delete pod to stop a node instead of scaling sts

3d76c7f

aluon force-pushed the aluon/push-lrkyvlwzlkxm branch from 3f50267 to 3d76c7f Compare November 8, 2024 18:11

check pod status instead of using health check

0fcc4e0

aluon force-pushed the aluon/push-lrkyvlwzlkxm branch from 9b60399 to 0fcc4e0 Compare November 8, 2024 18:49

aluon closed this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete pod to stop a node instead of scaling sts #15214

delete pod to stop a node instead of scaling sts #15214

aluon commented Nov 6, 2024 •

edited

Loading

trunk-io bot commented Nov 6, 2024 •

edited

Loading

ibalajiarun Nov 6, 2024

aluon Nov 8, 2024

ibalajiarun Nov 8, 2024

aluon commented Nov 8, 2024

aluon commented Nov 14, 2024

delete pod to stop a node instead of scaling sts #15214

delete pod to stop a node instead of scaling sts #15214

Conversation

aluon commented Nov 6, 2024 • edited Loading

Description

How Has This Been Tested?

Key Areas to Review

Type of Change

Which Components or Systems Does This Change Impact?

Checklist

trunk-io bot commented Nov 6, 2024 • edited Loading

ibalajiarun Nov 6, 2024

Choose a reason for hiding this comment

aluon Nov 8, 2024

Choose a reason for hiding this comment

ibalajiarun Nov 8, 2024

Choose a reason for hiding this comment

aluon commented Nov 8, 2024

aluon commented Nov 14, 2024

aluon commented Nov 6, 2024 •

edited

Loading

trunk-io bot commented Nov 6, 2024 •

edited

Loading