Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: panic: runtime error: index out of range [-1] #10204

Open
tgriffitts-ns opened this issue Dec 10, 2024 · 3 comments
Open

Bug: panic: runtime error: index out of range [-1] #10204

tgriffitts-ns opened this issue Dec 10, 2024 · 3 comments

Comments

@tgriffitts-ns
Copy link

tgriffitts-ns commented Dec 10, 2024

What is the bug?

Since updating to Mimir 2.14.2 we have seen Mimir begin to crash regularly. Here is a stack trace of the recent crash from the logs. Including a few log lines just before the event in case the mentioned query in those logs might be a trigger.

It appears frontendv1pb.(*frontendProcessClient).Send might be sending invalid data to protobuf unmarshal. It appears the -1 is a meta result for a count of records, but it is being passed without checking to unmarshal.

Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.837384293Z caller=spanlogger.go:111 query="avg by (instance) (quantile_over_time(0.95, (1 - rate(node_cpu_seconds_total{instance=\"nsiqlookup04.am2.nskope.net:basic\",mode=\"idle\"}[5m]))[1w:1m]))" query_timestamp=1733827478857 user=anonymous caller=log.go:168 level=error user=anonymous msg="error processing request" try=0 err="expanding series: rpc error: code = Canceled desc = grpc: the client connection is closing"
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.837737062Z caller=spanlogger.go:111 method=storeGatewayStreamReader.StartBuffering user=anonymous caller=log.go:168 level=error msg="received error while streaming chunks from store-gateway" err="aborted stream because query was cancelled: context canceled"
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.8381008Z caller=spanlogger.go:111 query="avg by (instance) (quantile_over_time(0.95, (1 - rate(node_cpu_seconds_total{instance=\"nsiqlookup04.am2.nskope.net:basic\",mode=\"idle\"}[5m]))[1w:1m]))" query_timestamp=1733827478857 user=anonymous caller=log.go:168 level=error user=anonymous msg="error processing request" try=0 err="expanding series: rpc error: code = Canceled desc = grpc: the client connection is closing"
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: ts=2024-12-10T10:45:15.840156327Z caller=spanlogger.go:111 query="avg by (instance) (quantile_over_time(0.95, (1 - rate(node_cpu_seconds_total{instance=\"nsiqlookup04.am2.nskope.net:basic\",mode=\"idle\"}[5m]))[1w:1m]))" query_timestamp=1733827478857 user=anonymous caller=log.go:168 level=error user=anonymous msg="error processing request" try=0 err="expanding series: rpc error: code = Canceled desc = grpc: the client connection is closing"

Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: panic: runtime error: index out of range [-1]
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: goroutine 330564995 [running]:
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb.(*ClientToFrontend).MarshalToSizedBuffer(0x114b5957720, {0x13ea096a2a0, 0xd9, 0xd9})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb/frontend.pb.go:745 +0x267
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb.(*ClientToFrontend).Marshal(0x114b5957720)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb/frontend.pb.go:699 +0x4d
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x34788c8, 0x133792a9130}, {0x0, 0x0, 0x0}, 0x0})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/protobuf/internal/impl/legacy_message.go:411 +0xf2
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/protobuf/proto.MarshalOptions.marshal({{}, 0xe0?, 0x8?, 0x70?}, {0x0, 0x0, 0x0}, {0x34788c8, 0x133792a9130})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/protobuf/proto/encode.go:194 +0x35d
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/protobuf/proto.Marshal({0x3430a20, 0x133792a9130})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/protobuf/proto/encode.go:110 +0x50
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc/encoding/proto.codec.Marshal({}, {0x2c29040?, 0x114b5957720?})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/encoding/proto/proto.go:47 +0x52
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc.encode({0x7fb01ffde008?, 0x4c16fe0?}, {0x2c29040?, 0x114b5957720?})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/rpc_util.go:647 +0x3e
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc.prepareMsg({0x2c29040?, 0x114b5957720?}, {0x7fb01ffde008?, 0x4c16fe0?}, {0x0, 0x0}, {0x0, 0x0})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/stream.go:1772 +0xc7
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: google.golang.org/grpc.(*clientStream).SendMsg(0x1263479a900, {0x2c29040, 0x114b5957720})
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/vendor/google.golang.org/grpc/stream.go:894 +0xd7
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb.(*frontendProcessClient).Send(0x17b5f9c0de0?, 0x133710fa3f0?)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/frontend/v1/frontendv1pb/frontend.pb.go:533 +0x25
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).process.func1(0x122f1937400, 0x13cc9c46e10)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/querier/worker/frontend_processor.go:125 +0xcd
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: github.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).runRequest(0xc0043ad940, {0x344db78?, 0x126412e02d0?}, 0x114ed589630, 0x1, 0x14976, 0xfc631ae040)
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/querier/worker/frontend_processor.go:182 +0x47a
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: created by github.com/grafana/mimir/pkg/querier/worker.(*frontendProcessor).process in goroutine 507
Dec 10 10:45:15 mon-prom-lt-storage01 mimir[2220384]: #011/go/src/github.com/grafana/mimir/pkg/querier/worker/frontend_processor.go:122 +0x2db
Dec 10 10:46:11 mon-prom-lt-storage01 systemd[1]: ns-monitoring-store-mimir.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 10 10:46:11 mon-prom-lt-storage01 systemd[1]: ns-monitoring-store-mimir.service: Failed with result 'exit-code'.

A quick look seems like the issue is with using i on line 745 without checking its validity after it was modified by the lines preceeding:

https://github.com/grafana/mimir/blob/main/pkg/frontend/v1/frontendv1pb/frontend.pb.go#L741-L745

How to reproduce it?

No good exact steps to reproduce but hoping the stacktrace will help pinpoint the code which can be updated to guard against the problem.

  1. Start Mimir 2.14.2
  2. Wait about 36 hours with heavy ingestion.

What did you think would happen?

Bad data shouldn't crash mimir

What was your environment?

bare metal 128CPU 2TB RAM 35TB storage. 4 node cluster: 2 nodes targeting 'all' including the one which crashed, 2 targing: read,backend to help with queries. replication_factor 1.

Any additional context to share?

No response

@seizethedave
Copy link
Contributor

Just noting that this looks remarkably similar to #8140. The -1 reference is the protobuf library code filling a pre-allocated buffer right to left, but some other goroutine concurrently modifying the proto payload so that the varint representation is larger than it was when the pre-allocated buf's length was computed.

Do you by chance have a full goroutine dump, or just the stack from the one goroutine?

@tgriffitts-ns
Copy link
Author

@seizethedave, I am sorry we don't have more. Is there something I can add at startup of mimir which will output a 'core dump' or something, if it crashes?

@seizethedave
Copy link
Contributor

@tgriffitts-ns You can start the affected Mimir service with the GOTRACEBACK=all environment variable setting. This will log all goroutine stacks rather than just the one. This would likely be useful in finding the data race.

If you tell me whether you're using Helm, Jsonnet, etc I can probably give better instructions on setting this env var.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants