Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gfd] Add option to disable automatic cleanup features file on gpu-feature-discovery exit #796

Closed
belo4ya opened this issue Jul 1, 2024 · 4 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@belo4ya
Copy link
Contributor

belo4ya commented Jul 1, 2024

Issue description

We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number == nvidia.com/gpu.count == Node Allocatable.

We have noticed that sometimes after restarting gpu-feature-discovery, all the features (labels nvidia.com/*) exported by gpu-feature-discovery disappear from the node for a period roughly equal to the nfd-worker sleepInterval (in our case, 1 minute). This causes false positives in our monitoring system.

We found that this occurs because gpu-feature-discovery deletes the features.d/gfd file before terminating if it is not running in one-shot mode (done using the removeOutputFile function).

This behavior is very inconvenient (and undesirable) for us, especially when updating the gpu-feature-discovery version in the cluster.

Feature request

I found that this behavior was added with this commit - NVIDIA/gpu-feature-discovery@bc91c4a. However, I did not find an associated Issue justifying the need for this behavior.

Could you please consider:

  1. Adding an option to disable automatic cleanup before gpu-feature-discovery terminates using a flag (and/or environment variable) (e.g., --no-cleanup-on-exit).
  2. Or the refusal to automatically clean up before shutting down the gpu-feature-discovery.

An argument for 2. could be that node-feature-discovery does not do this. Instead, it uses a prune-job.

@belo4ya
Copy link
Contributor Author

belo4ya commented Jul 10, 2024

@elezar, @klueska, @ArangoGutierrez, please take a look at this

@elezar
Copy link
Member

elezar commented Aug 12, 2024

Thanks @belo4ya, I have created #899 to add this option and we can continue this discussion there.

@ArangoGutierrez one thing that I noted is that we don't do any cleanup when the NodeFeatureAPI is used. How are labels removed in this case?

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2024
Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants