Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis.Client#XReadUntilResult(): also re-try timeout errors #23

Merged
merged 1 commit into from
Oct 18, 2024

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Jun 27, 2024

We need to retry the XREAD commands in the following situations:

  • If Go Redis returns redis.Nil, it means no data was read from Redis — e.g. when the keys don’t exist yet, and we will need to retry the operation again.
  • To prevent surpassing Go Redis's internal maximum retries or any other I/O timeouts 1, it's important to set a block timeout greater than zero for the XREAD commands, see the a.Block assignment. However, setting a block timeout means that Go Redis will not retry any errors internally and will instead return an I/O timeout error when exceeding the timeout. Thus, we need to handle this here and retry it again. On the other hand, an I/O timeout could also mean a context.DeadlineExceeded error, which is not retryable, so we have to check for context termination by ourselves via ctx.Err().

fixes Icinga/icingadb#786
ref/IP/54215

Footnotes

  1. https://github.com/redis/go-redis/issues/2131

@Al2Klimov Al2Klimov added bug Something isn't working ref/IP labels Jun 27, 2024
@cla-bot cla-bot bot added the cla/signed CLA is signed by all contributors of a PR label Jun 27, 2024
@Al2Klimov
Copy link
Member Author

CC @julianbrost @tbauriedel

Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right way of resolving the underlying issue. I mean, retrying the error might actually workaround it, but that's not the proper solution. go-redis is capable of retrying i/o timeouts, but this code path is only effective if you do not explicitly define a read timeout, which is not the case here.
https://github.com/redis/go-redis/blob/v9.5.1/stream_commands.go#L163-L165

And our default BlockTimeout happens to be 1s, which effectively disables the go-redis internal retry handling and overrides its read timeout of 30s.

ReadTimeout: c.Options.Timeout,
Timeout time.Duration `yaml:"timeout" default:"30s"`

As I mentioned above, retrying it might fix the issue, but we should consider dropping the default BlockTimeout and only setting it if explicitly configured by the user.

@Al2Klimov
Copy link
Member Author

... so the user could run into ref/IP/54215 depending on their config? Scary IMAO. In contrast, my implementation works always.

@yhabteab
Copy link
Member

yhabteab commented Aug 6, 2024

Scary IMAO. In contrast, my implementation works always.

No, it is not! If you as a user configure 30s as a timeout, then you should not be surprised if Icinga DB gives up after exactly these seconds and thus it is not something that Icinga DB should keep retrying.

@julianbrost
Copy link
Contributor

julianbrost commented Aug 6, 2024

As I mentioned above, retrying it might fix the issue, but we should consider dropping the default BlockTimeout and only setting it if explicitly configured by the user.

That block in the XREAD is set for a reason, see cd96f0d. So if you'd just undo that, that problem would probably come back.

... so the user could run into ref/IP/54215 depending on their config?

Please don't just reference internal tickets as part of your argument without any kind of summary. It's annoying to look up the whole ticket to get what you're trying to say and for everyone outside of Icinga, it's even impossible to follow.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing the analysis/explanation why this is supposed to fix the reported error. Have you tried reproducing the issue? So far I don't even understand what's happening here, it's a connection on localhost, where would a delay of 10 seconds come from?

redis/client.go Outdated Show resolved Hide resolved
@Al2Klimov Al2Klimov force-pushed the XReadUntilResult-re-try-timeout branch from d806ac3 to 33a5b7b Compare August 7, 2024 11:17
redis/client.go Outdated Show resolved Hide resolved
@Al2Klimov Al2Klimov force-pushed the XReadUntilResult-re-try-timeout branch from 33a5b7b to 96c22d1 Compare August 8, 2024 09:36
julianbrost
julianbrost previously approved these changes Aug 9, 2024
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After building Icinga DB pinned to this branch of Icinga Go Library, the behavior in the following test scenario changes:

killall -STOP redis-server && date && sleep 3m && killall -CONT redis-server && date

Before

Redis was paused from 16:44:59Z to 16:47:59Z:

2024-08-09T15:45:10.262Z    FATAL   icingadb    read tcp 172.18.0.21:59398->172.18.0.5:6379: i/o timeout
can't perform "[xread count 4096 block 1000 streams icinga:history:stream:notification 1723218296379-1]"
github.com/icinga/icinga-go-library/redis.WrapCmdErr
    /go/pkg/mod/github.com/icinga/[email protected]/redis/utils.go:15
github.com/icinga/icinga-go-library/redis.(*Client).XReadUntilResult
    /go/pkg/mod/github.com/icinga/[email protected]/redis/client.go:222
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
    /icingadb-src/pkg/icingadb/history/sync.go:112
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
    /icingadb-src/pkg/icingadb/history/sync.go:81
golang.org/x/sync/errgroup.(*Group).Go.func1
    /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1695
can't read history
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
    /icingadb-src/pkg/icingadb/history/sync.go:114
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
    /icingadb-src/pkg/icingadb/history/sync.go:81
golang.org/x/sync/errgroup.(*Group).Go.func1
    /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1695

After

Redis was paused from 16:59:21Z to 17:02:21Z:

2024-08-09T16:00:20.141Z    WARN    heartbeat   Lost Icinga heartbeat   {"timeout": "1m0s"}
2024-08-09T16:00:20.141Z    ERROR   high-availability   Lost heartbeat  
2024-08-09T16:01:20.141Z    WARN    heartbeat   Waiting for Icinga heartbeat
2024-08-09T16:02:20.141Z    WARN    heartbeat   Waiting for Icinga heartbeat
2024-08-09T16:02:21.735Z    INFO    heartbeat   Received Icinga heartbeat   {"environment": "6cc4013ece4ffdbb275a4a763460507d17203146"}
2024-08-09T16:02:21.736Z    ERROR   high-availability   Received heartbeat from the past    {"time": "2024-08-09T15:59:24.555Z"}
2024-08-09T16:02:21.739Z    ERROR   high-availability   Received heartbeat from the past    {"time": "2024-08-09T15:59:25.571Z"}
2024-08-09T16:02:23.170Z    INFO    high-availability   Another instance is active  {"instance_id": "fe568d4d2a074dea822696dcd79801df", "environment": "6cc4013ece4ffdbb275a4a763460507d17203146", "heartbeat": "2024-08-09T16:02:23.030Z", "heartbeat_age": "140.936828ms"} 

It now survives. Logging could be better though, but this probably has to be handled on the Icinga DB side.

redis/client.go Show resolved Hide resolved
yhabteab
yhabteab previously approved these changes Sep 16, 2024
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with: docker kill -s STOP redis && date && sleep 180 && docker kill -s CONT redis && date

Before:

$ docker kill -s STOP redis && date && sleep 180 && docker kill -s CONT redis && date
redis
Mo 16 Sep 2024 15:24:02 CEST

--------
2024-09-16T15:24:13.040+0200    FATAL   main    read tcp 127.0.0.1:51709->127.0.0.1:6380: i/o timeout
can't perform "[xread block 1000 streams icinga:schema 1726492901357-0]"
github.com/icinga/icinga-go-library/redis.WrapCmdErr
        /Users/yhabteab/Workspace/go/icinga-go-library/redis/utils.go:15
github.com/icinga/icinga-go-library/redis.(*Client).XReadUntilResult
        /Users/yhabteab/Workspace/go/icinga-go-library/redis/client.go:223
main.checkRedisSchema
        /Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:392
main.monitorRedisSchema
        /Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:374
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_arm64.s:1222
can't read Redis schema version
main.checkRedisSchema
        /Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:396
main.monitorRedisSchema
        /Users/yhabteab/Workspace/go/icingadb/cmd/icingadb/main.go:374
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_arm64.s:1222
exit status 1

After:

2024-09-16T15:19:39.225+0200    ERROR   high-availability       Lost heartbeat
2024-09-16T15:19:39.225+0200    WARN    main    Handing over    {"reason": "lost heartbeat"}
2024-09-16T15:20:39.226+0200    WARN    heartbeat       Waiting for Icinga heartbeat
2024-09-16T15:21:39.227+0200    WARN    heartbeat       Waiting for Icinga heartbeat
2024-09-16T15:21:41.002+0200    INFO    heartbeat       Received Icinga heartbeat       {"environment": "e9698ca8a2cd336a55a948ea51903878d13869bd"}
2024-09-16T15:21:41.002+0200    ERROR   high-availability       Received heartbeat from the past        {"time": "2024-09-16T15:18:59.371+0200"}
2024-09-16T15:21:41.371+0200    INFO    main    Taking over     {"reason": "no other instance is active"}
2024-09-16T15:21:41.376+0200    INFO    main    Starting config sync
2024-09-16T15:21:41.376+0200    INFO    main    Starting initial state sync
2024-09-16T15:21:41.376+0200    INFO    main    Starting overdue sync

redis/client.go Outdated Show resolved Hide resolved
We need to retry the XREAD commands in the following situations:

- If Go Redis returns redis.Nil, it means no data was read from Redis — e.g. when
  the keys don’t exist yet, and we will need to retry the operation again.
- To prevent surpassing Go Redis's internal maximum retries or any other I/O timeouts [^1],
  it's important to set a block timeout greater than zero for the XREAD commands, see the
  `a.Block` assignment. However, setting a block timeout means that Go Redis will not retry
  any errors internally and will instead return an I/O timeout error when exceeding the timeout.
  Thus, we need to handle this here and retry it again. On the other hand, an I/O timeout could
  also mean a `context.DeadlineExceeded` error, which is not retryable, so we have to check for
  context termination by ourselves via `ctx.Err()`.

[^1]: redis/go-redis#2131
@yhabteab yhabteab force-pushed the XReadUntilResult-re-try-timeout branch from a41c810 to 966dfd7 Compare October 18, 2024 09:18
@yhabteab yhabteab requested a review from lippserd October 18, 2024 09:19
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the adjustments @yhabteab.

@yhabteab yhabteab removed the request for review from julianbrost October 18, 2024 10:12
@yhabteab yhabteab merged commit 902150a into main Oct 18, 2024
15 checks passed
@yhabteab yhabteab deleted the XReadUntilResult-re-try-timeout branch October 18, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla/signed CLA is signed by all contributors of a PR ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout in Redis XREAD: read tcp [...]: i/o timeout
5 participants