-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redis.Client#XReadUntilResult(): also re-try timeout errors #23
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right way of resolving the underlying issue. I mean, retrying the error might actually workaround it, but that's not the proper solution. go-redis
is capable of retrying i/o timeouts, but this code path is only effective if you do not explicitly define a read timeout, which is not the case here.
https://github.com/redis/go-redis/blob/v9.5.1/stream_commands.go#L163-L165
And our default BlockTimeout
happens to be 1s
, which effectively disables the go-redis internal retry handling and overrides its read timeout of 30s
.
icinga-go-library/redis/client.go
Line 58 in 420fbff
ReadTimeout: c.Options.Timeout, |
icinga-go-library/redis/config.go
Line 15 in 420fbff
Timeout time.Duration `yaml:"timeout" default:"30s"` |
As I mentioned above, retrying it might fix the issue, but we should consider dropping the default BlockTimeout
and only setting it if explicitly configured by the user.
... so the user could run into ref/IP/54215 depending on their config? Scary IMAO. In contrast, my implementation works always. |
No, it is not! If you as a user configure |
That block in the XREAD is set for a reason, see cd96f0d. So if you'd just undo that, that problem would probably come back.
Please don't just reference internal tickets as part of your argument without any kind of summary. It's annoying to look up the whole ticket to get what you're trying to say and for everyone outside of Icinga, it's even impossible to follow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing the analysis/explanation why this is supposed to fix the reported error. Have you tried reproducing the issue? So far I don't even understand what's happening here, it's a connection on localhost, where would a delay of 10 seconds come from?
d806ac3
to
33a5b7b
Compare
33a5b7b
to
96c22d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After building Icinga DB pinned to this branch of Icinga Go Library, the behavior in the following test scenario changes:
killall -STOP redis-server && date && sleep 3m && killall -CONT redis-server && date
Before
Redis was paused from 16:44:59Z to 16:47:59Z:
2024-08-09T15:45:10.262Z FATAL icingadb read tcp 172.18.0.21:59398->172.18.0.5:6379: i/o timeout
can't perform "[xread count 4096 block 1000 streams icinga:history:stream:notification 1723218296379-1]"
github.com/icinga/icinga-go-library/redis.WrapCmdErr
/go/pkg/mod/github.com/icinga/[email protected]/redis/utils.go:15
github.com/icinga/icinga-go-library/redis.(*Client).XReadUntilResult
/go/pkg/mod/github.com/icinga/[email protected]/redis/client.go:222
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
/icingadb-src/pkg/icingadb/history/sync.go:112
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
/icingadb-src/pkg/icingadb/history/sync.go:81
golang.org/x/sync/errgroup.(*Group).Go.func1
/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1695
can't read history
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
/icingadb-src/pkg/icingadb/history/sync.go:114
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
/icingadb-src/pkg/icingadb/history/sync.go:81
golang.org/x/sync/errgroup.(*Group).Go.func1
/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1695
After
Redis was paused from 16:59:21Z to 17:02:21Z:
2024-08-09T16:00:20.141Z WARN heartbeat Lost Icinga heartbeat {"timeout": "1m0s"}
2024-08-09T16:00:20.141Z ERROR high-availability Lost heartbeat
2024-08-09T16:01:20.141Z WARN heartbeat Waiting for Icinga heartbeat
2024-08-09T16:02:20.141Z WARN heartbeat Waiting for Icinga heartbeat
2024-08-09T16:02:21.735Z INFO heartbeat Received Icinga heartbeat {"environment": "6cc4013ece4ffdbb275a4a763460507d17203146"}
2024-08-09T16:02:21.736Z ERROR high-availability Received heartbeat from the past {"time": "2024-08-09T15:59:24.555Z"}
2024-08-09T16:02:21.739Z ERROR high-availability Received heartbeat from the past {"time": "2024-08-09T15:59:25.571Z"}
2024-08-09T16:02:23.170Z INFO high-availability Another instance is active {"instance_id": "fe568d4d2a074dea822696dcd79801df", "environment": "6cc4013ece4ffdbb275a4a763460507d17203146", "heartbeat": "2024-08-09T16:02:23.030Z", "heartbeat_age": "140.936828ms"}
It now survives. Logging could be better though, but this probably has to be handled on the Icinga DB side.
96c22d1
to
2efd1f7
Compare
2efd1f7
to
6266f31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with: docker kill -s STOP redis && date && sleep 180 && docker kill -s CONT redis && date
Before:
$ docker kill -s STOP redis && date && sleep 180 && docker kill -s CONT redis && date
redis
Mo 16 Sep 2024 15:24:02 CEST
--------
2024-09-16T15:24:13.040+0200 FATAL main read tcp 127.0.0.1:51709->127.0.0.1:6380: i/o timeout
can't perform "[xread block 1000 streams icinga:schema 1726492901357-0]"
github.com/icinga/icinga-go-library/redis.WrapCmdErr
/Users/yhabteab/Workspace/go/icinga-go-library/redis/utils.go:15
github.com/icinga/icinga-go-library/redis.(*Client).XReadUntilResult
/Users/yhabteab/Workspace/go/icinga-go-library/redis/client.go:223
main.checkRedisSchema
/Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:392
main.monitorRedisSchema
/Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:374
runtime.goexit
/opt/homebrew/opt/go/libexec/src/runtime/asm_arm64.s:1222
can't read Redis schema version
main.checkRedisSchema
/Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:396
main.monitorRedisSchema
/Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:374
runtime.goexit
/opt/homebrew/opt/go/libexec/src/runtime/asm_arm64.s:1222
exit status 1
After:
2024-09-16T15:19:39.225+0200 ERROR high-availability Lost heartbeat
2024-09-16T15:19:39.225+0200 WARN main Handing over {"reason": "lost heartbeat"}
2024-09-16T15:20:39.226+0200 WARN heartbeat Waiting for Icinga heartbeat
2024-09-16T15:21:39.227+0200 WARN heartbeat Waiting for Icinga heartbeat
2024-09-16T15:21:41.002+0200 INFO heartbeat Received Icinga heartbeat {"environment": "e9698ca8a2cd336a55a948ea51903878d13869bd"}
2024-09-16T15:21:41.002+0200 ERROR high-availability Received heartbeat from the past {"time": "2024-09-16T15:18:59.371+0200"}
2024-09-16T15:21:41.371+0200 INFO main Taking over {"reason": "no other instance is active"}
2024-09-16T15:21:41.376+0200 INFO main Starting config sync
2024-09-16T15:21:41.376+0200 INFO main Starting initial state sync
2024-09-16T15:21:41.376+0200 INFO main Starting overdue sync
6266f31
to
a41c810
Compare
We need to retry the XREAD commands in the following situations: - If Go Redis returns redis.Nil, it means no data was read from Redis — e.g. when the keys don’t exist yet, and we will need to retry the operation again. - To prevent surpassing Go Redis's internal maximum retries or any other I/O timeouts [^1], it's important to set a block timeout greater than zero for the XREAD commands, see the `a.Block` assignment. However, setting a block timeout means that Go Redis will not retry any errors internally and will instead return an I/O timeout error when exceeding the timeout. Thus, we need to handle this here and retry it again. On the other hand, an I/O timeout could also mean a `context.DeadlineExceeded` error, which is not retryable, so we have to check for context termination by ourselves via `ctx.Err()`. [^1]: redis/go-redis#2131
a41c810
to
966dfd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the adjustments @yhabteab.
We need to retry the XREAD commands in the following situations:
a.Block
assignment. However, setting a block timeout means that Go Redis will not retry any errors internally and will instead return an I/O timeout error when exceeding the timeout. Thus, we need to handle this here and retry it again. On the other hand, an I/O timeout could also mean acontext.DeadlineExceeded
error, which is not retryable, so we have to check for context termination by ourselves viactx.Err()
.fixes Icinga/icingadb#786
ref/IP/54215
Footnotes
https://github.com/redis/go-redis/issues/2131 ↩