Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Container doesn't start up after ungraceful termination #511

Open
cpockrandt opened this issue Dec 4, 2023 · 2 comments
Open

[BUG] Container doesn't start up after ungraceful termination #511

cpockrandt opened this issue Dec 4, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@cpockrandt
Copy link
Contributor

Describe the bug

OpenSearch 2.11 is running in Kubernetes with 3 pods, pretty vanilla installation with the latest Helm Chart 2.17.0. The pods terminate ungracefully (e.g., through a blackout of the cluster or a bug in the node eviction not respecting the graceful termination period).

After the pods come back up, one of the pods starts outputting errors:

{"type": "server", "timestamp": "2023-11-24T14:52:02,526Z", "level": "ERROR", "component": "o.o.b.OpenSearchUncaughtExceptionHandler", "cluster.name": "os", "node.name": "os-mngr-1", "message": "uncaught exception in thread [main]", 
"stacktrace": ["org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/opensearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
...

It seems there are lock files on the PVC that prevents the container from starting:

/usr/share/opensearch/data/nodes/0/node.lock
/usr/share/opensearch/data/nodes/0/_state/write.lock

What is the recommended way dealing with this situation? Deleting the lock files lets the container start, but unassigned charts remain leaving the cluster in a yellow state. The only solution that worked so far: deleting the PVC and the pod and have the stateful set recreate the PVC and pod, and have the missing replicas of indices get recreated.

Shouldn't be there a way or a documentation how to proceed in such a case? Ideally with a less aggressive strategy than deleting the disk?

I also tried asking the community for help, but no luck so far: https://forum.opensearch.org/t/unassigned-shards-after-killed-containers-blackout/16812

@cpockrandt cpockrandt added bug Something isn't working untriaged Issues that have not yet been triaged labels Dec 4, 2023
@dblock dblock transferred this issue from opensearch-project/OpenSearch Dec 7, 2023
@prudhvigodithi
Copy link
Member

[Untriage]
Hey @cpockrandt thanks for reporting this bug, may I know if you are using a NFS for the PVC ?

@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label Dec 19, 2023
@prudhvigodithi prudhvigodithi transferred this issue from opensearch-project/opensearch-devops Dec 19, 2023
@github-actions github-actions bot added the untriaged Issues that have not yet been triaged label Dec 19, 2023
@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label Dec 19, 2023
@cpockrandt
Copy link
Contributor Author

cpockrandt commented Dec 20, 2023

[Untriage] Hey @cpockrandt thanks for reporting this bug, may I know if you are using a NFS for the PVC ?

Hey @prudhvigodithi, I used PVC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 📦 Backlog
Development

No branches or pull requests

2 participants