-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop stream when stream state inconsistency is detected during flush #6237
base: main
Are you sure you want to change the base?
Stop stream when stream state inconsistency is detected during flush #6237
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand that the current behaviour probably isn't ideal but we need more context on why the filesystem would suddenly become read-only before proceeding here. This seems like a serious environmental misconfiguration if the store directory suddenly changes underneath the server.
@neilalexander We are running NATS in K8s cluster and CSI driver attaches a storage volume to the K8s pod over network(iSCSI). We see filesystem can go to read-only in our environment due to various issues. Hence, we want NATS to stop itself cleanly for now when such a situation arises. |
@neilalexander when a network outage heppens in any of network attached storage we experience this kind of issue. |
We could potentially ensure that the server disables jetstream on a write error, but it should do that already IIRC.. |
@derekcollison I think server disabling jetstream would impact streams which are configured with other stream stores as well(e.g. memstore). For filesystem issues, only streams configured with filestore are impacted.Also, based on our observation, stream enters an inconsistent state if the TTL expires all the messages in the stream before filesystem recovers. Hence, the implementation in this PR is trying limit the corruption handling only to the stream which detects state corruption. |
There is only one store directory per server, so that error would have to disable all of jetstream. |
Hi @derekcollison Just want to be doubly sure on this. Stopping jetsream will also stop memory based streams. Is that fine? Implementation-wise this will get tricky in how one can find file based without iterating all the streams which will be inefficient. So it might be okay to stop jetstream. |
Yes disabling jetstream is the right thing to do here. But we should do that when we encounter a disk error. |
This PR is intended to handle the stream state reset observed when filesystem enters read-only mode.(for reference: #6211)
This PR introduces the following:
stream_stop_on_corruption
for jetstream to stop stream when filestore corruption is observed.a. When filesystem enters read-only mode, filestore detects stream state corruption which resets the in-memory state of stream to zero and future publish fails even if filesystem recovers until nats server restarts and restores stream state on server startup
b. We have also seen scenarios when steam state was reset and publish resumed without needing a restart. This caused the consumers to stop consuming new msgs from stream since consumer stream seq was way ahead of stream seq
c. Hence, to avoid these situations, we introduce a new server configuration to stop stream as soon as stream state corruption is detected during writing stream state onto disk.
a. This change allows to ensure that inconsistent state is not flushed onto disk before stream is stopped