Skip to content

Commit

Permalink
Merge branch 'main' into patch-46
Browse files Browse the repository at this point in the history
  • Loading branch information
kflynn authored Oct 22, 2024
2 parents 9ac2ade + db10727 commit 6e1a5f8
Show file tree
Hide file tree
Showing 5 changed files with 114 additions and 137 deletions.
12 changes: 6 additions & 6 deletions linkerd.io/content/2-edge/features/telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ requiring any work on the part of the developer. These features include:

This data can be consumed in several ways:

* Through the [Linkerd CLI](../../reference/cli/), e.g. with `linkerd viz stat` and
`linkerd viz routes`.
* Through the [Linkerd CLI](../../reference/cli/), e.g. with `linkerd viz stat-inbound`
and `linkerd viz stat-outbound`.
* Through the [Linkerd dashboard](../dashboard/), and
[pre-built Grafana dashboards](../../tasks/grafana/).
* Directly from Linkerd's built-in Prometheus instance
Expand All @@ -47,17 +47,17 @@ This data can be consumed in several ways:
This is the percentage of successful requests during a time window (1 minute by
default).

In the output of the command `linkerd viz routes -o wide`, this metric is split
into EFFECTIVE_SUCCESS and ACTUAL_SUCCESS. For routes configured with retries,
In the output of the command `linkerd viz stat-outbound`, this metric is shown
for routes and for individual backends. For routes configured with retries,
the former calculates the percentage of success after retries (as perceived by
the client-side), and the latter before retries (which can expose potential
problems with the service).

### Traffic (Requests Per Second)

This gives an overview of how much demand is placed on the service/route. As
with success rates, `linkerd viz routes -o wide` splits this metric into
EFFECTIVE_RPS and ACTUAL_RPS, corresponding to rates after and before retries
with success rates, `linkerd viz stat-outbound` splits this metric into
route level and backend level, corresponding to rates after and before retries
respectively.

### Latencies
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,7 @@ linkerd viz install \
| kubectl apply -f -

# ignore if not using the jaeger extension
linkerd jaeger install
linkerd jaeger install \
--set webhook.externalSecret=true \
--set-file webhook.caBundle=ca.crt \
| kubectl apply -f -
Expand Down
193 changes: 86 additions & 107 deletions linkerd.io/content/2-edge/tasks/books.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,53 +104,43 @@ more details on how this works.)

## Debugging

Let's use Linkerd to discover the root cause of this app's failures. Linkerd's
proxy exposes rich metrics about the traffic that it processes, including HTTP
response codes. The metric that we're interested is `outbound_http_route_backend_response_statuses_total`
and will help us identify where HTTP errors are occuring. We can use the
`linkerd diagnostics proxy-metrics` command to get proxy metrics. Pick one of
your webapp pods and run the following command to get the metrics for HTTP 500
responses:
Let's use Linkerd to discover the root cause of this app's failures. We can use
the `stat-inbound` command to see the success rate of the webapp deployment:

```bash
linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_backend_response_statuses_total \
| grep http_status=\"500\"
linkerd viz -n booksapp stat-inbound deploy/webapp
NAME SERVER ROUTE TYPE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
webapp [default]:4191 [default] 100.00% 0.30 4ms 9ms 10ms
webapp [default]:4191 probe 100.00% 0.60 0ms 1ms 1ms
webapp [default]:7000 probe 100.00% 0.30 2ms 2ms 2ms
webapp [default]:7000 [default] 75.66% 8.22 18ms 65ms 93ms
```

This should return a metric that looks something like:

```text
outbound_http_route_backend_response_statuses_total{
parent_group="core",
parent_kind="Service",
parent_namespace="booksapp",
parent_name="books",
parent_port="7002",
parent_section_name="",
route_group="",
route_kind="default",
route_namespace="",
route_name="http",
backend_group="core",
backend_kind="Service",
backend_namespace="booksapp",
backend_name="books",
backend_port="7002",
backend_section_name="",
http_status="500",
error=""
} 207
This shows us inbound traffic statistics. In other words, we see that the webapp
is receiving 8.22 requests per second on port 7000 and that only 75.66% of those
requests are successful.

To dig into this further and find the root cause, we can look at the webapp's
outbound traffic. This will tell us about the requests that the webapp makes to
other services.

```bash
linkerd viz -n booksapp stat-outbound deploy/webapp
NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES
webapp books:7002 [default] 77.36% 7.95 25ms 48ms 176ms 0.00% 0.00%
└──────────────────► books:7002 77.36% 7.95 15ms 44ms 64ms 0.00%
webapp authors:7001 [default] 100.00% 3.53 26ms 72ms 415ms 0.00% 0.00%
└──────────────────► authors:7001 100.00% 3.53 16ms 52ms 91ms 0.00%
```

This counter tells us that the webapp pod received a total of 207 HTTP 500
responses from the `books` Service on port 7002.
We see that webapp sends traffic to both the books service and the authors
service and that the problem seems to be with the traffic to the books service.

## HTTPRoute

We know that the webapp component is getting 500s from the books component, but
it would be great to narrow this down further and get per route metrics. To do
this, we take advantage of the Gateway API and define a set of HTTPRoute
We know that the webapp component is getting failures from the books component,
but it would be great to narrow this down further and get per route metrics. To
do this, we take advantage of the Gateway API and define a set of HTTPRoute
resources, each attached to the `books` Service by specifying it as their
`parent_ref`.

Expand Down Expand Up @@ -239,36 +229,19 @@ Notice that the `Accepted` and `ResolvedRefs` conditions are `True`.
[...]
```

With those HTTPRoutes in place, we can look at the `outbound_http_route_backend_response_statuses_total`
metric again, and see that the route labels have been populated:
With those HTTPRoutes in place, we can look at the outbound stats again:

```bash
linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_backend_response_statuses_total \
| grep http_status=\"500\"
```

```text
outbound_http_route_backend_response_statuses_total{
parent_group="core",
parent_kind="Service",
parent_namespace="booksapp",
parent_name="books",
parent_port="7002",
parent_section_name="",
route_group="gateway.networking.k8s.io",
route_kind="HTTPRoute",
route_namespace="booksapp",
route_name="books-create",
backend_group="core",
backend_kind="Service",
backend_namespace="booksapp",
backend_name="books",
backend_port="7002",
backend_section_name="",
http_status="500",
error=""
} 212
linkerd viz -n booksapp stat-outbound deploy/webapp
NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES
webapp authors:7001 [default] 100.00% 2.80 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► authors:7001 100.00% 2.80 16ms 45ms 49ms 0.00%
webapp books:7002 books-list HTTPRoute 100.00% 1.43 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 1.43 12ms 24ms 25ms 0.00%
webapp books:7002 books-create HTTPRoute 54.27% 2.73 27ms 207ms 441ms 0.00% 0.00%
└─────────────────────► books:7002 54.27% 2.73 14ms 152ms 230ms 0.00%
webapp books:7002 books-delete HTTPRoute 100.00% 0.72 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 0.72 12ms 24ms 25ms 0.00%
```

This tells us that it is requests to the `books-create` HTTPRoute which have
Expand All @@ -287,37 +260,54 @@ kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
retry.linkerd.io/http=5xx
```

We can then see the effect of these retries by looking at Linkerd's retry
metrics:
We can then see the effect of these retries:

```bash
linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_backend_response_statuses_total \
| grep retry
```

```text
outbound_http_route_retry_limit_exceeded_total{...} 222
outbound_http_route_retry_overflow_total{...} 0
outbound_http_route_retry_requests_total{...} 469
outbound_http_route_retry_successes_total{...} 247
linkerd viz -n booksapp stat-outbound deploy/webapp
NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES
webapp books:7002 books-create HTTPRoute 73.17% 2.05 98ms 460ms 492ms 0.00% 34.22%
└─────────────────────► books:7002 48.13% 3.12 29ms 93ms 99ms 0.00%
webapp books:7002 books-list HTTPRoute 100.00% 1.50 25ms 48ms 49ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 1.50 12ms 24ms 25ms 0.00%
webapp books:7002 books-delete HTTPRoute 100.00% 0.73 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 0.73 12ms 24ms 25ms 0.00%
webapp authors:7001 [default] 100.00% 2.98 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► authors:7001 100.00% 2.98 16ms 44ms 49ms 0.00%
```

This tells us that Linkerd made a total of 469 retry requests, of which 247 were
successful. The remaining 222 failed and could not be retried again, since we
didn't raise the retry limit from its default of 1.
Notice that while the success rate of individual requests to the books backend
on the `books-create` route only have a success rate of about 50%, the overall
success rate on that route has been raised to 73% due to retries. We can also
see that 34.22% of the requests on this route are retries and that the improved
success rate has come at the expense of additional RPS to the backend and
increased overall latency.

We can improve this further by increasing this limit to allow more than 1 retry
By default, Linkerd will only attempt 1 retry per failure. We can improve
success rate further by increasing this limit to allow more than 1 retry
per request:

```bash
kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
retry.linkerd.io/limit=3
```

Over time you will see `outbound_http_route_retry_requests_total` and
`outbound_http_route_retry_successes_total` increase at a much higher rate than
`outbound_http_route_retry_limit_exceeded_total`.
Looking at the stats again:

```bash
linkerd viz -n booksapp stat-outbound deploy/webapp
NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES
webapp books:7002 books-delete HTTPRoute 100.00% 0.75 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 0.75 12ms 24ms 25ms 0.00%
webapp authors:7001 [default] 100.00% 2.92 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► authors:7001 100.00% 2.92 18ms 46ms 49ms 0.00%
webapp books:7002 books-create HTTPRoute 92.78% 1.62 111ms 461ms 492ms 0.00% 47.28%
└─────────────────────► books:7002 48.91% 3.07 42ms 179ms 236ms 0.00%
webapp books:7002 books-list HTTPRoute 100.00% 1.45 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 1.45 12ms 24ms 25ms 0.00%
```

We see that these additional retries have increased the overall success rate on
this route to 92.78%.

## Timeouts

Expand All @@ -337,30 +327,19 @@ getting so many that it's hard to see what's going on!)
We can see the effects of this timeout by running:

```bash
linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_request_statuses_total | grep books-create
linkerd viz -n booksapp stat-outbound deploy/webapp
NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES
webapp authors:7001 [default] 100.00% 2.85 26ms 49ms 370ms 0.00% 0.00%
└─────────────────────► authors:7001 100.00% 2.85 19ms 49ms 86ms 0.00%
webapp books:7002 books-create HTTPRoute 78.90% 1.82 45ms 449ms 490ms 21.10% 47.34%
└─────────────────────► books:7002 41.55% 3.45 24ms 134ms 227ms 11.11%
webapp books:7002 books-list HTTPRoute 100.00% 1.40 25ms 47ms 49ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 1.40 12ms 24ms 25ms 0.00%
webapp books:7002 books-delete HTTPRoute 100.00% 0.70 25ms 48ms 50ms 0.00% 0.00%
└─────────────────────► books:7002 100.00% 0.70 12ms 24ms 25ms 0.00%
```

```text
outbound_http_route_request_statuses_total{
[...]
route_name="books-create",
http_status="",
error="REQUEST_TIMEOUT"
} 151
outbound_http_route_request_statuses_total{
[...]
route_name="books-create",
http_status="201",
error=""
} 5548
outbound_http_route_request_statuses_total{
[...]
route_name="books-create",
http_status="500",
error=""
} 3194
```
We see that 21.10% of the requests are hitting this timeout.

## Clean Up

Expand Down
40 changes: 19 additions & 21 deletions linkerd.io/content/2-edge/tasks/fault-injection.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,17 +53,20 @@ After a little while, the stats will show 100% success rate. You can verify this
by running:

```bash
linkerd viz -n booksapp stat deploy
linkerd viz -n booksapp stat-inbound deploy
```

The output will end up looking at little like:

```bash
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
authors 1/1 100.00% 7.1rps 4ms 26ms 33ms 6
books 1/1 100.00% 8.6rps 6ms 73ms 95ms 6
traffic 1/1 - - - - - -
webapp 3/3 100.00% 7.9rps 20ms 76ms 95ms 9
NAME SERVER ROUTE TYPE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
authors [default]:4191 [default] 100.00% 0.20 0ms 1ms 1ms
authors [default]:7001 [default] 100.00% 3.00 2ms 36ms 43ms
books [default]:4191 [default] 100.00% 0.23 4ms 4ms 4ms
books [default]:7002 [default] 100.00% 3.60 2ms 2ms 2ms
traffic [default]:4191 [default] 100.00% 0.22 0ms 3ms 1ms
webapp [default]:4191 [default] 100.00% 0.72 4ms 5ms 1ms
webapp [default]:7000 [default] 100.00% 3.25 2ms 2ms 65ms
```

## Create the faulty backend
Expand Down Expand Up @@ -182,25 +185,20 @@ for details.

When Linkerd sees traffic going to the `books` service, it will send 9/10
requests to the original service and 1/10 to the error injector. You can see
what this looks like by running `stat` and filtering explicitly to just the
requests from `webapp`:
what this looks like by running `stat-outbound`:

```bash
linkerd viz stat -n booksapp deploy --from deploy/webapp
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
authors 1/1 98.15% 4.5rps 3ms 36ms 39ms 3
books 1/1 100.00% 6.7rps 5ms 27ms 67ms 6
error-injector 1/1 0.00% 0.7rps 1ms 1ms 1ms 3
linkerd viz stat-outbound -n booksapp deploy/webapp
NAME SERVICE ROUTE TYPE BACKEND SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TIMEOUTS RETRIES
webapp authors:7001 [default] 98.44% 4.28 25ms 47ms 50ms 0.00% 0.00%
└────────────────────► authors:7001 98.44% 4.28 15ms 42ms 48ms 0.00%
webapp books:7002 error-split HTTPRoute 87.76% 7.22 26ms 49ms 333ms 0.00% 0.00%
├────────────────────► books:7002 100.00% 6.33 14ms 42ms 83ms 0.00%
└────────────────────► error-injector:8080 0.00% 0.88 12ms 24ms 25ms 0.00%
```

We can also look at the success rate of the `webapp` overall to see the effects
of the error injector. The success rate should be approximately 90%:

```bash
linkerd viz stat -n booksapp deploy/webapp
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
webapp 3/3 88.42% 9.5rps 14ms 37ms 75ms 10
```
We can see here that 0.88 requests per second are being sent to the error
injector and that the overall success rate is 87.76%.

## Cleanup

Expand Down
4 changes: 2 additions & 2 deletions linkerd.io/content/2-edge/tasks/multicluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -383,10 +383,10 @@ You'll see the `greeting from east` message! Requests from the `frontend` pod
running in `west` are being transparently forwarded to `east`. Assuming that
you're still port forwarding from the previous step, you can also reach this
with `curl http://localhost:8080/east`. Make that call a couple times and
you'll be able to get metrics from `linkerd viz stat` as well.
you'll be able to get metrics from `linkerd viz stat-outbound` as well.

```bash
linkerd --context=west -n test viz stat --from deploy/frontend svc
linkerd --context=west -n test viz stat-outbound deploy/frontend
```

We also provide a grafana dashboard to get a feel for what's going on here (see
Expand Down

0 comments on commit 6e1a5f8

Please sign in to comment.