You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since updating to Mimir 2.14.2 we have seen Mimir begin to crash regularly. Here is a stack trace of the recent crash from the logs. Including a few log lines just before the event in case the mentioned query in those logs might be a trigger.
It appears frontendv1pb.(*frontendProcessClient).Send might be sending invalid data to protobuf unmarshal. It appears the -1 is a meta result for a count of records, but it is being passed without checking to unmarshal.
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.837384293Z caller=spanlogger.go:111 query="avg by (instance) (quantile_over_time(0.95, (1 - rate(node_cpu_seconds_total{instance=\"nsiqlookup04.am2.nskope.net:basic\",mode=\"idle\"}[5m]))[1w:1m]))" query_timestamp=1733827478857 user=anonymous caller=log.go:168 level=error user=anonymous msg="error processing request" try=0 err="expanding series: rpc error: code = Canceled desc = grpc: the client connection is closing"
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.837737062Z caller=spanlogger.go:111 method=storeGatewayStreamReader.StartBuffering user=anonymous caller=log.go:168 level=error msg="received error while streaming chunks from store-gateway" err="aborted stream because query was cancelled: context canceled"
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.8381008Z caller=spanlogger.go:111 query="avg by (instance) (quantile_over_time(0.95, (1 - rate(node_cpu_seconds_total{instance=\"nsiqlookup04.am2.nskope.net:basic\",mode=\"idle\"}[5m]))[1w:1m]))" query_timestamp=1733827478857 user=anonymous caller=log.go:168 level=error user=anonymous msg="error processing request" try=0 err="expanding series: rpc error: code = Canceled desc = grpc: the client connection is closing"
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.840156327Z caller=spanlogger.go:111 query="avg by (instance) (quantile_over_time(0.95, (1 - rate(node_cpu_seconds_total{instance=\"nsiqlookup04.am2.nskope.net:basic\",mode=\"idle\"}[5m]))[1w:1m]))" query_timestamp=1733827478857 user=anonymous caller=log.go:168 level=error user=anonymous msg="error processing request" try=0 err="expanding series: rpc error: code = Canceled desc = grpc: the client connection is closing"
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: panic: runtime error: index out of range [-1]
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: goroutine 330564995 [running]:
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb.(*ClientToFrontend).MarshalToSizedBuffer(0x114b5957720, {0x13ea096a2a0, 0xd9, 0xd9})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb/frontend.pb.go:745 +0x267
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb.(*ClientToFrontend).Marshal(0x114b5957720)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb/frontend.pb.go:699 +0x4d
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x34788c8, 0x133792a9130}, {0x0, 0x0, 0x0}, 0x0})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/protobuf/internal/impl/legacy_message.go:411 +0xf2
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/protobuf/proto.MarshalOptions.marshal({{}, 0xe0?, 0x8?, 0x70?}, {0x0, 0x0, 0x0}, {0x34788c8, 0x133792a9130})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/protobuf/proto/encode.go:194 +0x35d
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/protobuf/proto.Marshal({0x3430a20, 0x133792a9130})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/protobuf/proto/encode.go:110 +0x50
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc/encoding/proto.codec.Marshal({}, {0x2c29040?, 0x114b5957720?})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/encoding/proto/proto.go:47 +0x52
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc.encode({0x7fb01ffde008?, 0x4c16fe0?}, {0x2c29040?, 0x114b5957720?})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/rpc_util.go:647 +0x3e
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc.prepareMsg({0x2c29040?, 0x114b5957720?}, {0x7fb01ffde008?, 0x4c16fe0?}, {0x0, 0x0}, {0x0, 0x0})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/stream.go:1772 +0xc7
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc.(*clientStream).SendMsg(0x1263479a900, {0x2c29040, 0x114b5957720})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/stream.go:894 +0xd7
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb.(*frontendProcessClient).Send(0x17b5f9c0de0?, 0x133710fa3f0?)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb/frontend.pb.go:533 +0x25
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).process.func1(0x122f1937400, 0x13cc9c46e10)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/querier/worker/frontend_processor.go:125 +0xcd
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).runRequest(0xc0043ad940, {0x344db78?, 0x126412e02d0?}, 0x114ed589630, 0x1, 0x14976, 0xfc631ae040)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/querier/worker/frontend_processor.go:182 +0x47a
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: created by github.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).process in goroutine 507
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/querier/worker/frontend_processor.go:122 +0x2db
Dec 10 10:46:11 mon-prom-lt-storage01 systemd[1]: ns-monitoring-store-mimir.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 10 10:46:11 mon-prom-lt-storage01 systemd[1]: ns-monitoring-store-mimir.service: Failed with result 'exit-code'.
A quick look seems like the issue is with using i on line 745 without checking its validity after it was modified by the lines preceeding:
No good exact steps to reproduce but hoping the stacktrace will help pinpoint the code which can be updated to guard against the problem.
Start Mimir 2.14.2
Wait about 36 hours with heavy ingestion.
What did you think would happen?
Bad data shouldn't crash mimir
What was your environment?
bare metal 128CPU 2TB RAM 35TB storage. 4 node cluster: 2 nodes targeting 'all' including the one which crashed, 2 targing: read,backend to help with queries. replication_factor 1.
Any additional context to share?
No response
The text was updated successfully, but these errors were encountered:
Just noting that this looks remarkably similar to #8140. The -1 reference is the protobuf library code filling a pre-allocated buffer right to left, but some other goroutine concurrently modifying the proto payload so that the varint representation is larger than it was when the pre-allocated buf's length was computed.
Do you by chance have a full goroutine dump, or just the stack from the one goroutine?
@seizethedave, I am sorry we don't have more. Is there something I can add at startup of mimir which will output a 'core dump' or something, if it crashes?
@tgriffitts-ns You can start the affected Mimir service with the GOTRACEBACK=all environment variable setting. This will log all goroutine stacks rather than just the one. This would likely be useful in finding the data race.
If you tell me whether you're using Helm, Jsonnet, etc I can probably give better instructions on setting this env var.
What is the bug?
Since updating to Mimir 2.14.2 we have seen Mimir begin to crash regularly. Here is a stack trace of the recent crash from the logs. Including a few log lines just before the event in case the mentioned query in those logs might be a trigger.
It appears
frontendv1pb.(*frontendProcessClient).Send
might be sending invalid data to protobuf unmarshal. It appears the -1 is a meta result for a count of records, but it is being passed without checking to unmarshal.A quick look seems like the issue is with using
i
on line 745 without checking its validity after it was modified by the lines preceeding:https://github.com/grafana/mimir/blob/main/pkg/frontend/v1/frontendv1pb/frontend.pb.go#L741-L745
How to reproduce it?
No good exact steps to reproduce but hoping the stacktrace will help pinpoint the code which can be updated to guard against the problem.
What did you think would happen?
Bad data shouldn't crash mimir
What was your environment?
bare metal 128CPU 2TB RAM 35TB storage. 4 node cluster: 2 nodes targeting 'all' including the one which crashed, 2 targing: read,backend to help with queries. replication_factor 1.
Any additional context to share?
No response
The text was updated successfully, but these errors were encountered: