You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently Athens doesn't work at all if the SingleFlight store (etcd, redis etc.) is down or unavailable for some reason. New instances fail to start and the running instances also fail to continue working.
Proposal
I'm creating this as a proposal and a placeholder to get some comments before I actually start working on this.
The idea is to have another config to define the fallback mechanism, something like:
SingleFlightType = "redis"
SingleFlightFallbackType = "memory" // default to in-memory, possible values: none + all the values currently supported by SingleFlightType
When fallback is enabled and the primary SingleFlight store is down, Athens will fallback using the mechanism specified in the config and will continue trying to check the status of primary store in the background (retries with backoff) and switch to it when it's available.
This will only be a single layer fallback so if the fallback is also down then it'll still not work.
Also, this is mostly a good to have feature to improve the availability & resilience of the system. Our current fallback plan is to redeploy after changing the config, which is good enough for general use-case.
The text was updated successfully, but these errors were encountered:
I am not sure this will work. Consider You have 3 nodes behind an LB. If redis fails and then you have in-memory singleflight request, you may still get 3 single flight requests overwriting each other. I am not sure if there is a good way to guarantee uniqueness.
the easiest way is to probably write a new singleflight type backed by a managed by a cloud provider just like storage.
Consider You have 3 nodes behind an LB. If redis fails and then you have in-memory singleflight request, you may still get 3 single flight requests overwriting each other.
Yes if the fallback is configured to use in-memory then that's the expected behaviour after this implementation. The idea is to let it work even after redis is down, currently it doesn't work at all.
you may still get 3 single flight requests overwriting each other. I am not sure if there is a good way to guarantee uniqueness.
Also, like I described above - another distributed store (redis/ etcd etc.) can also be used as fallback so it can still guarantee uniqueness, it would just depend on your config.
Basically there're 3 possible scenarios:
Fallback configured to "none": Don't use any fallback and return an error like the current behavior.
Mainly for backwards compatibility or if you'd like to keep things simple and are okay with handling these situations manually.
Fallback configured to "memory": Fallback to memory until the distributed store is up. Athens will continue working but distributed singleflight won't work.
Use this If you'd want Athens to continue working without distributed locking.
Fallback configured to "etcd/redis/etc.": Fallback to another distributed cache until the primary one is up. Athens will continue working and distributed singleflight will also work.
Use this If you'd want Athens to continue working with distributed locking and are okay with maintaining a secondary distributed store.
So based on the resiliency requirements of the users they will be able to configure Athens accordingly.
Issue
Currently Athens doesn't work at all if the SingleFlight store (etcd, redis etc.) is down or unavailable for some reason. New instances fail to start and the running instances also fail to continue working.
Proposal
I'm creating this as a proposal and a placeholder to get some comments before I actually start working on this.
The idea is to have another config to define the fallback mechanism, something like:
When fallback is enabled and the primary SingleFlight store is down, Athens will fallback using the mechanism specified in the config and will continue trying to check the status of primary store in the background (retries with backoff) and switch to it when it's available.
This will only be a single layer fallback so if the fallback is also down then it'll still not work.
Also, this is mostly a good to have feature to improve the availability & resilience of the system. Our current fallback plan is to redeploy after changing the config, which is good enough for general use-case.
The text was updated successfully, but these errors were encountered: