Replies: 3 comments
-
Yes, this is a known issue. Especially when backup softwares try to take backups of cephfs PVC too frequently. There is a tracker in external provisioner for PVC being deleted when provisioner pod restarts kubernetes-csi/external-provisioner#486 . We've also requested ceph to reject cephfs clones creation request preemptively to avoid this scenario. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for summarizing this here. It definitely helps us to understand it's a known issue and get that full perspective on it. We'll try to find a solution that uses the new RO snapshot clones then to prevent this from happening in the first place. |
Beta Was this translation helpful? Give feedback.
-
There seems to be progress on the ceph tracker https://tracker.ceph.com/issues/59714. Converting this issue into disucssion since issue #3996 already exists to track and fix this. |
Beta Was this translation helpful? Give feedback.
-
Background
We are still struggling with Kasten.io exports and the fact that they use the old RW snapshot clone method which on volumes with any significant size will result in a timeout.
The trouble is not just that the backups aren't working so that we needed to put a workaround in place (which is something for Kasten to solve), but also that whenever the issue happens we get a stray folder on CephFS which eventually eats up our disk space and is difficult and a bit scary to clean up when you don't have a clear reference where it comes from.
We blame it on the issue below and I'd love to hear if anyone else has experienced this. If there's interest to work on this and you need to me to reproduce it manually, I'm happy to share details.
Issue
When cloning a CephFS snapshot, if the PVC is deleted while the cloning process is still ongoing, a stray clone remains in the CephFS filesystem.
Affected Versions
Tested on ceph-csi 3.8
Steps to Reproduce
Initiate a clone from a snapshot of significant size in traditional RW mode.
Before the cloning process completes and the volume becomes available, delete the PVC.
Expected Behavior
The cloning process should either be interrupted and the clone should be removed from CephFS, or there should be a reference retained in Kubernetes.
Actual Behavior
The cloning process continues uninterrupted, resulting in a new folder appearing on CephFS. However, there is no reference to this folder in Kubernetes, neither as a PV nor a PVC.
Beta Was this translation helpful? Give feedback.
All reactions