Update RPC chart to use additional zero downtime mechanisms #84

sreuland · 2024-04-18T20:53:32Z

Update the rpc chart to include an updates of the statefulset that assist for better zero downtime of rpc at runtime and upgrade time:

default the sts to replicas=2, this will help ensure zero downtime during upgrades, as the sts controller can always keep one replica pod up while it updates the other, as updating a pod requires restart, which also incurs a captive core catchup/sync which can be significant time that the instance is not passing readiness during that startup time.
leverage new rpc health status info to verify retention window fully populated condition into the readiness probe check

These changes have been verified on cluster deployments using the same changes directly through k8s manifests:
https://github.com/stellar/kube/pull/2098

Closes #82

…2 for zero downtime during upgrade

jacekn

LGTM

2opremio · 2024-04-19T09:19:11Z

I would parameterize the readiness probe with respect to having the history window full. When you start rpc for the first time (and for tests) it’s likely that you don’t want to wait for a day. In fact I would make it optional.

jacekn · 2024-04-19T09:40:57Z

I would parameterize the readiness probe with respect to having the history window full. When you start rpc for the first time (and for tests) it’s likely that you don’t want to wait for a day. In fact I would make it optional.

We agreed to improve the way health status is exposed by soroban-rpc so I think it would be best to do that before we add any options to the chart. This way we avoid potential option renames and extra operational overhead for operators.
Rewiring the probe internally in the chart is transparent, most operators don't need to know about it, so I think this is OK. But once you start adding options you have to start thinking about lifecycle for those and this will add overhead so I think it should only be done after we make changes to soroban-rpc

2opremio · 2024-04-19T11:21:24Z

See stellar/stellar-rpc#146

sreuland · 2024-04-19T18:37:27Z

good points on both sides, stellar/stellar-rpc#146 seems like may be too soon, can we agree to wait for the updated rpc health status feature spec to be completed first? I've spun up feature ticket here - stellar/stellar-rpc#148, please review/comment.

and can this pr proceed forward with chart as-is, per @jacekn 's feedback on avoiding config churn for operators? otherwise, adding the external value config param per @2opremio 's suggestion right here sounds like would be a good safety valve.

#82: added readiness probe usage of new rpc health info and replicas=…

2e809ac

…2 for zero downtime during upgrade

sreuland mentioned this pull request Apr 18, 2024

enable zero-downtime deployments for RPC #82

Closed

sreuland requested review from jacekn and 2opremio April 18, 2024 20:57

mollykarcher approved these changes Apr 18, 2024

View reviewed changes

jacekn approved these changes Apr 19, 2024

View reviewed changes

sreuland merged commit 480e62d into main Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update RPC chart to use additional zero downtime mechanisms #84

Update RPC chart to use additional zero downtime mechanisms #84

sreuland commented Apr 18, 2024 •

edited

Loading

jacekn left a comment

2opremio commented Apr 19, 2024

jacekn commented Apr 19, 2024

2opremio commented Apr 19, 2024

sreuland commented Apr 19, 2024

Update RPC chart to use additional zero downtime mechanisms #84

Update RPC chart to use additional zero downtime mechanisms #84

Conversation

sreuland commented Apr 18, 2024 • edited Loading

jacekn left a comment

Choose a reason for hiding this comment

2opremio commented Apr 19, 2024

jacekn commented Apr 19, 2024

2opremio commented Apr 19, 2024

sreuland commented Apr 19, 2024

sreuland commented Apr 18, 2024 •

edited

Loading