enable zero-downtime deployments for RPC #82

mollykarcher · 2024-04-10T17:09:58Z

What problem does your feature solve?

In it's current form, RPC takes ~30 minutes to deploy new versions to pubnet (thread1, thread2) due to iops limits when initializing it's in-memory data storage from disk.

What would you like to see?

A new RPC version rolls out, and there's no disruption in service. There is also no loss of historical transaction/events history upon rollout (that is, the db/history does not reset to nothing).

What alternatives are there?

Blue/green deployment model. We'd maintain 2 instances of RPC, with one always being kept in "standby" mode and not used for client requests
Horizontally scale RPC to 2 replicas, each using their own independent PVC, and load balance between them. On deployments, we would deploy to one at a time, making sure we always had 1 ready/available. This is the strategy that we think is better/optimal

sreuland · 2024-04-10T17:26:15Z

I think both options converge to option#2 as a blue/green, with two deployments on cluster one for each color, as it is not possible to update a single replica(pod) within one deployment that has replicas set to more than 1, i.e. all replicas(pods) will inherit the config set on the deployment(defined as pod spec in the deployment spec), this is maintained by the Deployment controller which runs on cluster and constantly monitors deployment pod states to make sure they equal the deployment spec and match deployment replicas count.

sreuland · 2024-04-10T20:03:26Z

we discussed this more in platform team meeting, and thanks @mollykarcher for wrangling ideas further on chat, your summarized option 'magic bullet' approach of using existing StatefuleSet with replicas=2 sounds like viable option to achieve zero down time during upgrades. this provisions for one ordinal pod to always be healthy during upgrade and routable(included as Endpoint) on the k8s Service associated to the StatefuleSet.

So, we should test replicas=2 out in dev to determine if we can land on that to resolve here. One potential caveat from having this horizontally scaled model when both replicas are healthy and routed to service, there may be potential for each instance to be slightly off on their ingested ledger/network states, potentially reporting different responses for same url requests at about the same time. We'd have to see how this looks at run time to see if significant.

Another interesting option if we want to explore a blue/green or canary approach further is with statefulset rollingupdate partitioning which seems to provide a basis for either of those.

mollykarcher · 2024-04-10T21:46:24Z

...there may be potential for each instance to be slightly off on their ingested ledger/network states, potentially reporting different responses for same url requests at about the same time

I agree that this possibility exists, but let's not over-optimize before we know we have a problem. For now, we might want to just monitor and/or alert on any persistent diff in the LCL between the two instances. Could give us a sense of how likely this issue is.

We could also probably delay/lessen the effects of this simply by enabling sticky sessions/session affinity on the rpc ingress.

sreuland · 2024-04-12T22:20:28Z

results with replicas=2 on dev:

observed the rollout behavior on k8s after statefulset with an existing single replica was updated to replicas=2, it followed the sts scaling docs, i.e. the existing '0' ordinal pod is left as-is and it creates '1' ordinal pod, no service down time occurs in this scenario.
- However, the first time rpc is scaled to two replicas, it does allow for the new 2nd pod to start responding to url requests through service before it has a full 24 hour data window populated, so it will produce inconsistent results on json-rpc requests until the 24 hour window has passed. This situation can be fixed by changing the readiness probe to parse and compare the new ledger range info recently included on getHealth response per Add ledger range to getHealth endpoint stellar-rpc#133, should include this fix as part of this ticket.
tested ensuing upgrades when already on 2 healthy replicas, and saw the expected rolling upgrade behavior, the statefulset will update the pods serially in reverse order, during which no http server down time through the service was seen.

…2 for zero downtime during upgrade

sreuland · 2024-04-18T20:57:08Z

two thirds of this effort are complete:
the k8s resource changes are done on dev cluster here:
https://github.com/stellar/kube/pull/2098

the helm-chart update to include the changes:
#84

last step will be to merge same change to dev cluster when soroban rpc 21.0.0 is GA.

mollykarcher added the objective-2 label Apr 10, 2024

mollykarcher added this to the platform sprint 45 milestone Apr 10, 2024

mollykarcher assigned 2opremio Apr 10, 2024

mollykarcher added this to Platform Scrum Apr 10, 2024

github-project-automation bot moved this to Backlog in Platform Scrum Apr 10, 2024

mollykarcher moved this from Backlog to To Do in Platform Scrum Apr 10, 2024

mollykarcher assigned sreuland Apr 10, 2024

sreuland moved this from To Do to In Progress in Platform Scrum Apr 10, 2024

sreuland mentioned this issue Apr 10, 2024

Stabilize RPC rolling update behavior on k8s #79

Closed

mollykarcher unassigned 2opremio Apr 16, 2024

sreuland mentioned this issue Apr 16, 2024

added jq to rpc image for readiness probe usage stellar/stellar-rpc#143

Merged

sreuland moved this from In Progress to Needs Review in Platform Scrum Apr 18, 2024

sreuland added a commit that referenced this issue Apr 18, 2024

#82: added readiness probe usage of new rpc health info and replicas=…

2e809ac

…2 for zero downtime during upgrade

sreuland mentioned this issue Apr 18, 2024

Update RPC chart to use additional zero downtime mechanisms #84

Merged

sreuland closed this as completed in #84 Apr 22, 2024

github-project-automation bot moved this from Needs Review to Done in Platform Scrum Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable zero-downtime deployments for RPC #82

enable zero-downtime deployments for RPC #82

mollykarcher commented Apr 10, 2024

sreuland commented Apr 10, 2024 •

edited

Loading

sreuland commented Apr 10, 2024 •

edited by mollykarcher

Loading

mollykarcher commented Apr 10, 2024

sreuland commented Apr 12, 2024

sreuland commented Apr 18, 2024

enable zero-downtime deployments for RPC #82

enable zero-downtime deployments for RPC #82

Comments

mollykarcher commented Apr 10, 2024

What problem does your feature solve?

What would you like to see?

What alternatives are there?

sreuland commented Apr 10, 2024 • edited Loading

sreuland commented Apr 10, 2024 • edited by mollykarcher Loading

mollykarcher commented Apr 10, 2024

sreuland commented Apr 12, 2024

sreuland commented Apr 18, 2024

sreuland commented Apr 10, 2024 •

edited

Loading

sreuland commented Apr 10, 2024 •

edited by mollykarcher

Loading