[BUG] Container doesn't start up after ungraceful termination #511

cpockrandt · 2023-12-04T19:26:02Z

Describe the bug

OpenSearch 2.11 is running in Kubernetes with 3 pods, pretty vanilla installation with the latest Helm Chart 2.17.0. The pods terminate ungracefully (e.g., through a blackout of the cluster or a bug in the node eviction not respecting the graceful termination period).

After the pods come back up, one of the pods starts outputting errors:

{"type": "server", "timestamp": "2023-11-24T14:52:02,526Z", "level": "ERROR", "component": "o.o.b.OpenSearchUncaughtExceptionHandler", "cluster.name": "os", "node.name": "os-mngr-1", "message": "uncaught exception in thread [main]", 
"stacktrace": ["org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/opensearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
...

It seems there are lock files on the PVC that prevents the container from starting:

/usr/share/opensearch/data/nodes/0/node.lock
/usr/share/opensearch/data/nodes/0/_state/write.lock

What is the recommended way dealing with this situation? Deleting the lock files lets the container start, but unassigned charts remain leaving the cluster in a yellow state. The only solution that worked so far: deleting the PVC and the pod and have the stateful set recreate the PVC and pod, and have the missing replicas of indices get recreated.

Shouldn't be there a way or a documentation how to proceed in such a case? Ideally with a less aggressive strategy than deleting the disk?

I also tried asking the community for help, but no luck so far: https://forum.opensearch.org/t/unassigned-shards-after-killed-containers-blackout/16812

The text was updated successfully, but these errors were encountered:

prudhvigodithi · 2023-12-19T21:30:16Z

[Untriage]
Hey @cpockrandt thanks for reporting this bug, may I know if you are using a NFS for the PVC ?

cpockrandt · 2023-12-20T18:16:53Z

[Untriage] Hey @cpockrandt thanks for reporting this bug, may I know if you are using a NFS for the PVC ?

Hey @prudhvigodithi, I used PVC.

cpockrandt added bug Something isn't working untriaged Issues that have not yet been triaged labels Dec 4, 2023

dblock transferred this issue from opensearch-project/OpenSearch Dec 7, 2023

prudhvigodithi removed the untriaged Issues that have not yet been triaged label Dec 19, 2023

prudhvigodithi transferred this issue from opensearch-project/opensearch-devops Dec 19, 2023

github-actions bot added the untriaged Issues that have not yet been triaged label Dec 19, 2023

prudhvigodithi removed the untriaged Issues that have not yet been triaged label Dec 19, 2023

peterzhuamazon added this to Engineering Effectiveness Board Jul 11, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jul 11, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Engineering Effectiveness Board Jul 18, 2024

peterzhuamazon moved this to 📦 Backlog in Engineering Effectiveness Board Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Container doesn't start up after ungraceful termination #511

[BUG] Container doesn't start up after ungraceful termination #511

cpockrandt commented Dec 4, 2023

prudhvigodithi commented Dec 19, 2023

cpockrandt commented Dec 20, 2023 •

edited

Loading

[BUG] Container doesn't start up after ungraceful termination #511

[BUG] Container doesn't start up after ungraceful termination #511

Comments

cpockrandt commented Dec 4, 2023

prudhvigodithi commented Dec 19, 2023

cpockrandt commented Dec 20, 2023 • edited Loading

cpockrandt commented Dec 20, 2023 •

edited

Loading