Enhance RPC Health Status #148

sreuland · 2024-04-19T18:28:10Z

What problem does your feature solve?

RPC's getHealth response doesn't represent the actual run time status levels that can be present on RPC.
RPC's getHealth requires POST and json-rpc request payload, and parsing of the json by client afterwards to determine status, which puts more lifting on the operator side to support doing the health checks, in general terms, health endpoints are usually exposed as HTTP GET endpoints to simplify what the operator side needs to perform to determine the status from response, i.e. just check the HTTP response code.

What would you like to see?

note - these requirements were elided from design discussion on https://github.com/stellar/kube/pull/2098#pullrequestreview-2005913742

RPC provides a /health/{qos_level} endpoint that is retrievable via HTTP GET.
RPC provides the http service that publishes this endpoint immediately after the rpc process is started, no delayed start of the http service.
RPC provides a 503 or 200 HTTP response code in response to a HTTP GET /health/{qos_level}, to indicate the level is active or not.

The new endpoint supports a notion of QoS levels for representing the different potential run time states that RPC can be in:

level 1 - service is completely unhealthy, the process is running but ingestion isn't stable yet to network, unable to process requests.
level 2 - service is running and forward ingestion with network is happening, data retention window is not fully caught up yet, but can process some json-rpc request endpoints.
level 3 - service is running, forward ingestion with network is happening, data retention window is full, all rpc request endpoints are up.

What alternatives are there?

use the current json-rpc getHealth

The text was updated successfully, but these errors were encountered:

overcat · 2024-11-07T02:31:36Z

I strongly support this feature. When I was configuring failover for sorobanrpc.com, I had to write an additional simple API service to proxy the getHealth interface, and then have the health checker access this API service. If soroban-rpc supported direct GET access to the getHealth interface, I wouldn't need to an extra API service.

(I'm unsure how many health checkers support posting JSON body during their health checks.)

mollykarcher · 2024-11-19T17:31:38Z

After @overcat's comment, I'm realizing that the link referenced in the issue description is to an internal repository, which defines the k8s manifests for SDF's service deployments. So it's not broadly public what we do, nor do we public any recommendations about this anywhere in our docs (which we should fix also!). We currently deploy it via k8s, and define the readinessProbe as follows:

readinessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - |
        curl -s --location --request POST 'http://127.0.0.1:8000/' \
          --header 'Content-Type: application/json' \
          --data-raw '{
            "jsonrpc": "2.0",
            "id": 10235,
            "method": "getHealth"
          }' | jq -es 'if (. | length) == 0 then null else .[0] end | .result | .status == "healthy" and (.latestLedger - .oldestLedger >= (.ledgerRetentionWindow - 10))' > /dev/null;
  failureThreshold: 1
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 2

So we are effectively parsing out the ledger range in order to determine health of the instance.

sreuland · 2024-12-09T19:52:41Z

additional scope to consider for QoS readiness levels are to capture 'live' simulation and 'proof' simulation capability.

sreuland added the rpc-sdk-scrum label Apr 19, 2024

sreuland added this to Platform Scrum Apr 19, 2024

github-project-automation bot moved this to Backlog in Platform Scrum Apr 19, 2024

sreuland mentioned this issue Apr 19, 2024

Update RPC chart to use additional zero downtime mechanisms stellar/helm-charts#84

Merged

janewang added objective-5 and removed rpc-sdk-scrum labels Apr 23, 2024

sreuland mentioned this issue Jun 3, 2024

Add http endpoints for rpc functions #188

Open

mollykarcher added rpc-sdk-scrum and removed objective-5 labels Jun 25, 2024

mollykarcher added this to the platform sprint 54 milestone Nov 19, 2024

mollykarcher moved this from Backlog to To Do in Platform Scrum Nov 19, 2024

2opremio modified the milestones: platform sprint 54, Platform Sprint 55 Dec 3, 2024

Shaptic modified the milestones: platform sprint 55, platform sprint 54 Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance RPC Health Status #148

Enhance RPC Health Status #148

sreuland commented Apr 19, 2024

overcat commented Nov 7, 2024

mollykarcher commented Nov 19, 2024 •

edited

Loading

sreuland commented Dec 9, 2024

Enhance RPC Health Status #148

Enhance RPC Health Status #148

Comments

sreuland commented Apr 19, 2024

What problem does your feature solve?

What would you like to see?

What alternatives are there?

overcat commented Nov 7, 2024

mollykarcher commented Nov 19, 2024 • edited Loading

sreuland commented Dec 9, 2024

mollykarcher commented Nov 19, 2024 •

edited

Loading