Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop stream when stream state inconsistency is detected during flush #6237

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pranavmehta94
Copy link

This PR is intended to handle the stream state reset observed when filesystem enters read-only mode.(for reference: #6211)

This PR introduces the following:

  1. A new server configuration stream_stop_on_corruption for jetstream to stop stream when filestore corruption is observed.
    a. When filesystem enters read-only mode, filestore detects stream state corruption which resets the in-memory state of stream to zero and future publish fails even if filesystem recovers until nats server restarts and restores stream state on server startup
    b. We have also seen scenarios when steam state was reset and publish resumed without needing a restart. This caused the consumers to stop consuming new msgs from stream since consumer stream seq was way ahead of stream seq
    c. Hence, to avoid these situations, we introduce a new server configuration to stop stream as soon as stream state corruption is detected during writing stream state onto disk.
  2. Mark stream’s filestore is corrupt when the server is configured to stop stream on state corruption.
    a. This change allows to ensure that inconsistent state is not flushed onto disk before stream is stopped

@pranavmehta94 pranavmehta94 marked this pull request as ready for review December 11, 2024 06:10
@pranavmehta94 pranavmehta94 requested a review from a team as a code owner December 11, 2024 06:10
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand that the current behaviour probably isn't ideal but we need more context on why the filesystem would suddenly become read-only before proceeding here. This seems like a serious environmental misconfiguration if the store directory suddenly changes underneath the server.

@pranavmehta94
Copy link
Author

@neilalexander We are running NATS in K8s cluster and CSI driver attaches a storage volume to the K8s pod over network(iSCSI). We see filesystem can go to read-only in our environment due to various issues. Hence, we want NATS to stop itself cleanly for now when such a situation arises.

@skyrocknroll
Copy link

@neilalexander when a network outage heppens in any of network attached storage we experience this kind of issue.

@derekcollison
Copy link
Member

We could potentially ensure that the server disables jetstream on a write error, but it should do that already IIRC..

@pranavmehta94
Copy link
Author

@derekcollison I think server disabling jetstream would impact streams which are configured with other stream stores as well(e.g. memstore). For filesystem issues, only streams configured with filestore are impacted.Also, based on our observation, stream enters an inconsistent state if the TTL expires all the messages in the stream before filesystem recovers. Hence, the implementation in this PR is trying limit the corruption handling only to the stream which detects state corruption.
Also, when we hit the filesystem read-only issue, we did not observe server disabling jetstream.

@derekcollison
Copy link
Member

There is only one store directory per server, so that error would have to disable all of jetstream.

@shiv4289
Copy link

Hi @derekcollison Just want to be doubly sure on this.

Stopping jetsream will also stop memory based streams. Is that fine?

Implementation-wise this will get tricky in how one can find file based without iterating all the streams which will be inefficient. So it might be okay to stop jetstream.

@derekcollison
Copy link
Member

Yes disabling jetstream is the right thing to do here. But we should do that when we encounter a disk error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants