You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenSearch 2.11 is running in Kubernetes with 3 pods, pretty vanilla installation with the latest Helm Chart 2.17.0. The pods terminate ungracefully (e.g., through a blackout of the cluster or a bug in the node eviction not respecting the graceful termination period).
After the pods come back up, one of the pods starts outputting errors:
{"type": "server", "timestamp": "2023-11-24T14:52:02,526Z", "level": "ERROR", "component": "o.o.b.OpenSearchUncaughtExceptionHandler", "cluster.name": "os", "node.name": "os-mngr-1", "message": "uncaught exception in thread [main]",
"stacktrace": ["org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/opensearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
...
It seems there are lock files on the PVC that prevents the container from starting:
What is the recommended way dealing with this situation? Deleting the lock files lets the container start, but unassigned charts remain leaving the cluster in a yellow state. The only solution that worked so far: deleting the PVC and the pod and have the stateful set recreate the PVC and pod, and have the missing replicas of indices get recreated.
Shouldn't be there a way or a documentation how to proceed in such a case? Ideally with a less aggressive strategy than deleting the disk?
Describe the bug
OpenSearch 2.11 is running in Kubernetes with 3 pods, pretty vanilla installation with the latest Helm Chart 2.17.0. The pods terminate ungracefully (e.g., through a blackout of the cluster or a bug in the node eviction not respecting the graceful termination period).
After the pods come back up, one of the pods starts outputting errors:
It seems there are lock files on the PVC that prevents the container from starting:
What is the recommended way dealing with this situation? Deleting the lock files lets the container start, but unassigned charts remain leaving the cluster in a yellow state. The only solution that worked so far: deleting the PVC and the pod and have the stateful set recreate the PVC and pod, and have the missing replicas of indices get recreated.
Shouldn't be there a way or a documentation how to proceed in such a case? Ideally with a less aggressive strategy than deleting the disk?
I also tried asking the community for help, but no luck so far: https://forum.opensearch.org/t/unassigned-shards-after-killed-containers-blackout/16812
The text was updated successfully, but these errors were encountered: