From 8029b03ec77949dbaa02a98056ac2f90e31d4ee4 Mon Sep 17 00:00:00 2001 From: Alex Leong Date: Mon, 21 Oct 2024 14:12:52 -0700 Subject: [PATCH 1/2] Update docs to use stat-inbound and stat-outbound (#1833) * Copy 2.16 docs to 2-edge Signed-off-by: Alex Leong * update docs to use stat-inbound and stat-outbound Signed-off-by: Alex Leong * Update authorization-policy.md This should be all lowercase. --------- Signed-off-by: Alex Leong Co-authored-by: Flynn --- .../content/2-edge/features/telemetry.md | 12 +- linkerd.io/content/2-edge/tasks/books.md | 193 ++++++++---------- .../content/2-edge/tasks/fault-injection.md | 40 ++-- .../content/2-edge/tasks/multicluster.md | 4 +- 4 files changed, 113 insertions(+), 136 deletions(-) diff --git a/linkerd.io/content/2-edge/features/telemetry.md b/linkerd.io/content/2-edge/features/telemetry.md index d5f8a9fef8..023fdfe5dd 100644 --- a/linkerd.io/content/2-edge/features/telemetry.md +++ b/linkerd.io/content/2-edge/features/telemetry.md @@ -34,8 +34,8 @@ requiring any work on the part of the developer. These features include: This data can be consumed in several ways: -* Through the [Linkerd CLI](../../reference/cli/), e.g. with `linkerd viz stat` and - `linkerd viz routes`. +* Through the [Linkerd CLI](../../reference/cli/), e.g. with `linkerd viz stat-inbound` + and `linkerd viz stat-outbound`. * Through the [Linkerd dashboard](../dashboard/), and [pre-built Grafana dashboards](../../tasks/grafana/). * Directly from Linkerd's built-in Prometheus instance @@ -47,8 +47,8 @@ This data can be consumed in several ways: This is the percentage of successful requests during a time window (1 minute by default). -In the output of the command `linkerd viz routes -o wide`, this metric is split -into EFFECTIVE_SUCCESS and ACTUAL_SUCCESS. For routes configured with retries, +In the output of the command `linkerd viz stat-outbound`, this metric is shown +for routes and for individual backends. For routes configured with retries, the former calculates the percentage of success after retries (as perceived by the client-side), and the latter before retries (which can expose potential problems with the service). @@ -56,8 +56,8 @@ problems with the service). ### Traffic (Requests Per Second) This gives an overview of how much demand is placed on the service/route. As -with success rates, `linkerd viz routes -o wide` splits this metric into -EFFECTIVE_RPS and ACTUAL_RPS, corresponding to rates after and before retries +with success rates, `linkerd viz stat-outbound` splits this metric into +route level and backend level, corresponding to rates after and before retries respectively. ### Latencies diff --git a/linkerd.io/content/2-edge/tasks/books.md b/linkerd.io/content/2-edge/tasks/books.md index 26e832e4a0..5a4b9eeee2 100644 --- a/linkerd.io/content/2-edge/tasks/books.md +++ b/linkerd.io/content/2-edge/tasks/books.md @@ -104,53 +104,43 @@ more details on how this works.) ## Debugging -Let's use Linkerd to discover the root cause of this app's failures. Linkerd's -proxy exposes rich metrics about the traffic that it processes, including HTTP -response codes. The metric that we're interested is `outbound_http_route_backend_response_statuses_total` -and will help us identify where HTTP errors are occuring. We can use the -`linkerd diagnostics proxy-metrics` command to get proxy metrics. Pick one of -your webapp pods and run the following command to get the metrics for HTTP 500 -responses: +Let's use Linkerd to discover the root cause of this app's failures. We can use +the `stat-inbound` command to see the success rate of the webapp deployment: ```bash -linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \ -| grep outbound_http_route_backend_response_statuses_total \ -| grep http_status=\"500\" +linkerd viz -n booksapp stat-inbound deploy/webapp +NAME SERVER ROUTE TYPE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 +webapp [default]:4191 [default] 100.00% 0.30 4ms 9ms 10ms +webapp [default]:4191 probe 100.00% 0.60 0ms 1ms 1ms +webapp [default]:7000 probe 100.00% 0.30 2ms 2ms 2ms +webapp [default]:7000 [default] 75.66% 8.22 18ms 65ms 93ms ``` -This should return a metric that looks something like: - -```text -outbound_http_route_backend_response_statuses_total{ - parent_group="core", - parent_kind="Service", - parent_namespace="booksapp", - parent_name="books", - parent_port="7002", - parent_section_name="", - route_group="", - route_kind="default", - route_namespace="", - route_name="http", - backend_group="core", - backend_kind="Service", - backend_namespace="booksapp", - backend_name="books", - backend_port="7002", - backend_section_name="", - http_status="500", - error="" -} 207 +This shows us inbound traffic statistics. In other words, we see that the webapp +is receiving 8.22 requests per second on port 7000 and that only 75.66% of those +requests are successful. + +To dig into this further and find the root cause, we can look at the webapp's +outbound traffic. This will tell us about the requests that the webapp makes to +other services. + +```bash +linkerd viz -n booksapp stat-outbound deploy/webapp +NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES +webapp books:7002 [default] 77.36% 7.95 25ms 48ms 176ms 0.00% 0.00% + └──────────────────► books:7002 77.36% 7.95 15ms 44ms 64ms 0.00% +webapp authors:7001 [default] 100.00% 3.53 26ms 72ms 415ms 0.00% 0.00% + └──────────────────► authors:7001 100.00% 3.53 16ms 52ms 91ms 0.00% ``` -This counter tells us that the webapp pod received a total of 207 HTTP 500 -responses from the `books` Service on port 7002. +We see that webapp sends traffic to both the books service and the authors +service and that the problem seems to be with the traffic to the books service. ## HTTPRoute -We know that the webapp component is getting 500s from the books component, but -it would be great to narrow this down further and get per route metrics. To do -this, we take advantage of the Gateway API and define a set of HTTPRoute +We know that the webapp component is getting failures from the books component, +but it would be great to narrow this down further and get per route metrics. To +do this, we take advantage of the Gateway API and define a set of HTTPRoute resources, each attached to the `books` Service by specifying it as their `parent_ref`. @@ -239,36 +229,19 @@ Notice that the `Accepted` and `ResolvedRefs` conditions are `True`. [...] ``` -With those HTTPRoutes in place, we can look at the `outbound_http_route_backend_response_statuses_total` -metric again, and see that the route labels have been populated: +With those HTTPRoutes in place, we can look at the outbound stats again: ```bash -linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \ -| grep outbound_http_route_backend_response_statuses_total \ -| grep http_status=\"500\" -``` - -```text -outbound_http_route_backend_response_statuses_total{ - parent_group="core", - parent_kind="Service", - parent_namespace="booksapp", - parent_name="books", - parent_port="7002", - parent_section_name="", - route_group="gateway.networking.k8s.io", - route_kind="HTTPRoute", - route_namespace="booksapp", - route_name="books-create", - backend_group="core", - backend_kind="Service", - backend_namespace="booksapp", - backend_name="books", - backend_port="7002", - backend_section_name="", - http_status="500", - error="" -} 212 +linkerd viz -n booksapp stat-outbound deploy/webapp +NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES +webapp authors:7001 [default] 100.00% 2.80 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► authors:7001 100.00% 2.80 16ms 45ms 49ms 0.00% +webapp books:7002 books-list HTTPRoute 100.00% 1.43 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 1.43 12ms 24ms 25ms 0.00% +webapp books:7002 books-create HTTPRoute 54.27% 2.73 27ms 207ms 441ms 0.00% 0.00% + └─────────────────────► books:7002 54.27% 2.73 14ms 152ms 230ms 0.00% +webapp books:7002 books-delete HTTPRoute 100.00% 0.72 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 0.72 12ms 24ms 25ms 0.00% ``` This tells us that it is requests to the `books-create` HTTPRoute which have @@ -287,27 +260,30 @@ kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \ retry.linkerd.io/http=5xx ``` -We can then see the effect of these retries by looking at Linkerd's retry -metrics: +We can then see the effect of these retries: ```bash -linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \ -| grep outbound_http_route_backend_response_statuses_total \ -| grep retry -``` - -```text -outbound_http_route_retry_limit_exceeded_total{...} 222 -outbound_http_route_retry_overflow_total{...} 0 -outbound_http_route_retry_requests_total{...} 469 -outbound_http_route_retry_successes_total{...} 247 +linkerd viz -n booksapp stat-outbound deploy/webapp +NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES +webapp books:7002 books-create HTTPRoute 73.17% 2.05 98ms 460ms 492ms 0.00% 34.22% + └─────────────────────► books:7002 48.13% 3.12 29ms 93ms 99ms 0.00% +webapp books:7002 books-list HTTPRoute 100.00% 1.50 25ms 48ms 49ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 1.50 12ms 24ms 25ms 0.00% +webapp books:7002 books-delete HTTPRoute 100.00% 0.73 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 0.73 12ms 24ms 25ms 0.00% +webapp authors:7001 [default] 100.00% 2.98 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► authors:7001 100.00% 2.98 16ms 44ms 49ms 0.00% ``` -This tells us that Linkerd made a total of 469 retry requests, of which 247 were -successful. The remaining 222 failed and could not be retried again, since we -didn't raise the retry limit from its default of 1. +Notice that while the success rate of individual requests to the books backend +on the `books-create` route only have a success rate of about 50%, the overall +success rate on that route has been raised to 73% due to retries. We can also +see that 34.22% of the requests on this route are retries and that the improved +success rate has come at the expense of additional RPS to the backend and +increased overall latency. -We can improve this further by increasing this limit to allow more than 1 retry +By default, Linkerd will only attempt 1 retry per failure. We can improve +success rate further by increasing this limit to allow more than 1 retry per request: ```bash @@ -315,9 +291,23 @@ kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \ retry.linkerd.io/limit=3 ``` -Over time you will see `outbound_http_route_retry_requests_total` and -`outbound_http_route_retry_successes_total` increase at a much higher rate than -`outbound_http_route_retry_limit_exceeded_total`. +Looking at the stats again: + +```bash +linkerd viz -n booksapp stat-outbound deploy/webapp +NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES +webapp books:7002 books-delete HTTPRoute 100.00% 0.75 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 0.75 12ms 24ms 25ms 0.00% +webapp authors:7001 [default] 100.00% 2.92 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► authors:7001 100.00% 2.92 18ms 46ms 49ms 0.00% +webapp books:7002 books-create HTTPRoute 92.78% 1.62 111ms 461ms 492ms 0.00% 47.28% + └─────────────────────► books:7002 48.91% 3.07 42ms 179ms 236ms 0.00% +webapp books:7002 books-list HTTPRoute 100.00% 1.45 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 1.45 12ms 24ms 25ms 0.00% +``` + +We see that these additional retries have increased the overall success rate on +this route to 92.78%. ## Timeouts @@ -337,30 +327,19 @@ getting so many that it's hard to see what's going on!) We can see the effects of this timeout by running: ```bash -linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \ -| grep outbound_http_route_request_statuses_total | grep books-create +linkerd viz -n booksapp stat-outbound deploy/webapp +NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES +webapp authors:7001 [default] 100.00% 2.85 26ms 49ms 370ms 0.00% 0.00% + └─────────────────────► authors:7001 100.00% 2.85 19ms 49ms 86ms 0.00% +webapp books:7002 books-create HTTPRoute 78.90% 1.82 45ms 449ms 490ms 21.10% 47.34% + └─────────────────────► books:7002 41.55% 3.45 24ms 134ms 227ms 11.11% +webapp books:7002 books-list HTTPRoute 100.00% 1.40 25ms 47ms 49ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 1.40 12ms 24ms 25ms 0.00% +webapp books:7002 books-delete HTTPRoute 100.00% 0.70 25ms 48ms 50ms 0.00% 0.00% + └─────────────────────► books:7002 100.00% 0.70 12ms 24ms 25ms 0.00% ``` -```text -outbound_http_route_request_statuses_total{ - [...] - route_name="books-create", - http_status="", - error="REQUEST_TIMEOUT" -} 151 -outbound_http_route_request_statuses_total{ - [...] - route_name="books-create", - http_status="201", - error="" -} 5548 -outbound_http_route_request_statuses_total{ - [...] - route_name="books-create", - http_status="500", - error="" -} 3194 -``` +We see that 21.10% of the requests are hitting this timeout. ## Clean Up diff --git a/linkerd.io/content/2-edge/tasks/fault-injection.md b/linkerd.io/content/2-edge/tasks/fault-injection.md index 2defa3d8a7..45a38b685c 100644 --- a/linkerd.io/content/2-edge/tasks/fault-injection.md +++ b/linkerd.io/content/2-edge/tasks/fault-injection.md @@ -53,17 +53,20 @@ After a little while, the stats will show 100% success rate. You can verify this by running: ```bash -linkerd viz -n booksapp stat deploy +linkerd viz -n booksapp stat-inbound deploy ``` The output will end up looking at little like: ```bash -NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN -authors 1/1 100.00% 7.1rps 4ms 26ms 33ms 6 -books 1/1 100.00% 8.6rps 6ms 73ms 95ms 6 -traffic 1/1 - - - - - - -webapp 3/3 100.00% 7.9rps 20ms 76ms 95ms 9 +NAME SERVER ROUTE TYPE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 +authors [default]:4191 [default] 100.00% 0.20 0ms 1ms 1ms +authors [default]:7001 [default] 100.00% 3.00 2ms 36ms 43ms +books [default]:4191 [default] 100.00% 0.23 4ms 4ms 4ms +books [default]:7002 [default] 100.00% 3.60 2ms 2ms 2ms +traffic [default]:4191 [default] 100.00% 0.22 0ms 3ms 1ms +webapp [default]:4191 [default] 100.00% 0.72 4ms 5ms 1ms +webapp [default]:7000 [default] 100.00% 3.25 2ms 2ms 65ms ``` ## Create the faulty backend @@ -182,25 +185,20 @@ for details. When Linkerd sees traffic going to the `books` service, it will send 9/10 requests to the original service and 1/10 to the error injector. You can see -what this looks like by running `stat` and filtering explicitly to just the -requests from `webapp`: +what this looks like by running `stat-outbound`: ```bash -linkerd viz stat -n booksapp deploy --from deploy/webapp -NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN -authors 1/1 98.15% 4.5rps 3ms 36ms 39ms 3 -books 1/1 100.00% 6.7rps 5ms 27ms 67ms 6 -error-injector 1/1 0.00% 0.7rps 1ms 1ms 1ms 3 +linkerd viz stat-outbound -n booksapp deploy/webapp +NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES +webapp authors:7001 [default] 98.44% 4.28 25ms 47ms 50ms 0.00% 0.00% + └────────────────────► authors:7001 98.44% 4.28 15ms 42ms 48ms 0.00% +webapp books:7002 error-split HTTPRoute 87.76% 7.22 26ms 49ms 333ms 0.00% 0.00% + ├────────────────────► books:7002 100.00% 6.33 14ms 42ms 83ms 0.00% + └────────────────────► error-injector:8080 0.00% 0.88 12ms 24ms 25ms 0.00% ``` -We can also look at the success rate of the `webapp` overall to see the effects -of the error injector. The success rate should be approximately 90%: - -```bash -linkerd viz stat -n booksapp deploy/webapp -NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN -webapp 3/3 88.42% 9.5rps 14ms 37ms 75ms 10 -``` +We can see here that 0.88 requests per second are being sent to the error +injector and that the overall success rate is 87.76%. ## Cleanup diff --git a/linkerd.io/content/2-edge/tasks/multicluster.md b/linkerd.io/content/2-edge/tasks/multicluster.md index 0230ff04e3..7e1e3cfa93 100644 --- a/linkerd.io/content/2-edge/tasks/multicluster.md +++ b/linkerd.io/content/2-edge/tasks/multicluster.md @@ -383,10 +383,10 @@ You'll see the `greeting from east` message! Requests from the `frontend` pod running in `west` are being transparently forwarded to `east`. Assuming that you're still port forwarding from the previous step, you can also reach this with `curl http://localhost:8080/east`. Make that call a couple times and -you'll be able to get metrics from `linkerd viz stat` as well. +you'll be able to get metrics from `linkerd viz stat-outbound` as well. ```bash -linkerd --context=west -n test viz stat --from deploy/frontend svc +linkerd --context=west -n test viz stat-outbound deploy/frontend ``` We also provide a grafana dashboard to get a feel for what's going on here (see From db10727dc7f2ef2800788707789037e91045eab2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ivan=20=28=EC=9D=B4=EB=B0=98=29=20Porta?= Date: Tue, 22 Oct 2024 16:49:29 +0200 Subject: [PATCH 2/2] Add missing \ (#1857) --- .../tasks/automatically-rotating-webhook-tls-credentials.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/linkerd.io/content/2-edge/tasks/automatically-rotating-webhook-tls-credentials.md b/linkerd.io/content/2-edge/tasks/automatically-rotating-webhook-tls-credentials.md index db5ab64b06..9e34336437 100644 --- a/linkerd.io/content/2-edge/tasks/automatically-rotating-webhook-tls-credentials.md +++ b/linkerd.io/content/2-edge/tasks/automatically-rotating-webhook-tls-credentials.md @@ -282,7 +282,7 @@ linkerd viz install \ | kubectl apply -f - # ignore if not using the jaeger extension -linkerd jaeger install +linkerd jaeger install \ --set webhook.externalSecret=true \ --set-file webhook.caBundle=ca.crt \ | kubectl apply -f -