[gfd] Add option to disable automatic cleanup features file on gpu-feature-discovery exit #796
Labels
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
Issue description
We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases:
Target Number == nvidia.com/gpu.count == Node Allocatable
.We have noticed that sometimes after restarting gpu-feature-discovery, all the features (labels
nvidia.com/*
) exported by gpu-feature-discovery disappear from the node for a period roughly equal to the nfd-workersleepInterval
(in our case, 1 minute). This causes false positives in our monitoring system.We found that this occurs because gpu-feature-discovery deletes the
features.d/gfd
file before terminating if it is not running in one-shot mode (done using the removeOutputFile function).This behavior is very inconvenient (and undesirable) for us, especially when updating the gpu-feature-discovery version in the cluster.
Feature request
I found that this behavior was added with this commit - NVIDIA/gpu-feature-discovery@bc91c4a. However, I did not find an associated Issue justifying the need for this behavior.
Could you please consider:
--no-cleanup-on-exit
).An argument for 2. could be that node-feature-discovery does not do this. Instead, it uses a prune-job.
The text was updated successfully, but these errors were encountered: