Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update RPC chart to use additional zero downtime mechanisms #84

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

sreuland
Copy link
Contributor

@sreuland sreuland commented Apr 18, 2024

Update the rpc chart to include an updates of the statefulset that assist for better zero downtime of rpc at runtime and upgrade time:

  • default the sts to replicas=2, this will help ensure zero downtime during upgrades, as the sts controller can always keep one replica pod up while it updates the other, as updating a pod requires restart, which also incurs a captive core catchup/sync which can be significant time that the instance is not passing readiness during that startup time.
  • leverage new rpc health status info to verify retention window fully populated condition into the readiness probe check

These changes have been verified on cluster deployments using the same changes directly through k8s manifests:
https://github.com/stellar/kube/pull/2098

Closes #82

Copy link
Contributor

@jacekn jacekn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@2opremio
Copy link
Contributor

I would parameterize the readiness probe with respect to having the history window full. When you start rpc for the first time (and for tests) it’s likely that you don’t want to wait for a day. In fact I would make it optional.

@jacekn
Copy link
Contributor

jacekn commented Apr 19, 2024

I would parameterize the readiness probe with respect to having the history window full. When you start rpc for the first time (and for tests) it’s likely that you don’t want to wait for a day. In fact I would make it optional.

We agreed to improve the way health status is exposed by soroban-rpc so I think it would be best to do that before we add any options to the chart. This way we avoid potential option renames and extra operational overhead for operators.
Rewiring the probe internally in the chart is transparent, most operators don't need to know about it, so I think this is OK. But once you start adding options you have to start thinking about lifecycle for those and this will add overhead so I think it should only be done after we make changes to soroban-rpc

@2opremio
Copy link
Contributor

See stellar/stellar-rpc#146

@sreuland
Copy link
Contributor Author

good points on both sides, stellar/stellar-rpc#146 seems like may be too soon, can we agree to wait for the updated rpc health status feature spec to be completed first? I've spun up feature ticket here - stellar/stellar-rpc#148, please review/comment.

and can this pr proceed forward with chart as-is, per @jacekn 's feedback on avoiding config churn for operators? otherwise, adding the external value config param per @2opremio 's suggestion right here sounds like would be a good safety valve.

@sreuland sreuland merged commit 480e62d into main Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enable zero-downtime deployments for RPC
4 participants