-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete non-LRU cache in SPIRE Agent #5383
Merged
amartinezfayo
merged 22 commits into
spiffe:main
from
amoore877:delete_non_lru_for_reals
Sep 27, 2024
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
c57535d
compiles wheeeee
amoore877 4ef00d8
tests compile
amoore877 b14ca20
local tests passing
amoore877 4cdf1a2
fix: lint (linux), lint (windows)
amoore877 f838ce8
fix integration test fetch-x509-svids
amoore877 f7ac93a
one more removeable config ref
amoore877 7a077c2
fix: lint (linux), lint (windows)
amoore877 b148c8f
helpful test comments
amoore877 81a6821
Merge branch 'main' into delete_non_lru_for_reals
amoore877 a6653fa
Merge branch 'main' into delete_non_lru_for_reals
amoore877 02fd57b
Merge branch 'main' into delete_non_lru_for_reals
amoore877 d7330e6
Merge branch 'main' into delete_non_lru_for_reals
amoore877 d11a640
Merge branch 'main' into delete_non_lru_for_reals
amoore877 37ec2dd
struct doc
amoore877 5cbf474
please stop the screaming
amoore877 0dce619
struct doc 2
amoore877 42e33b3
simplify struct
amoore877 7e6e85a
comment fix
amoore877 a1c5b9e
Merge branch 'main' into delete_non_lru_for_reals
amoore877 686804f
Merge branch 'main' into delete_non_lru_for_reals
amoore877 dd27054
Merge branch 'main' into delete_non_lru_for_reals
amoore877 2c0fb70
Merge branch 'main' into delete_non_lru_for_reals
amartinezfayo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we shouldn't keep this setting but promote it out of experimental. It has some implications for existing deployments, namely that the initial fetch of a SVID may become slower now if there are more active workloads than 1000. It can become noticeable for the case where you run lots of short lived workloads.
Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is actually what is being executed. 1.10 already has a merged PR #5150 for deprecating this feature (via warnings in logs). This PR is planned for 1.11 to finally remove it entirely.
I'm sure if there are raised complaints wrt to the deprecation and removal, SPIRE maintainers and the community would be happy to discuss. You can raise this in SPIFFE slack or in contributor sync
fwiw, and this of course would vary significantly based on operator environment, my organization's own load testing showed on the version this feature was introduced at most 1-3s of fetchx509svid latency experienced by consumers when the cache size was set to 1k and the agent was being cycled through 26k registrations.
I'd be curious to know for your own environment how registrations are being managed for short-lived- are they always statically registered? if not, are they aggressively or lazily culled? from a security perspective it's best to have an agent only be assigned precisely the workload identities that are expected to be running with it in that exact moment, though of course there are windows where there would be excess dependent on strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just something I wanted to raise as a potential issue as I've ran into a similar issue with other pieces of software. 1-3 seconds for something that is supposed to be started lots of times, for example batch jobs, can add up to a lot. With the probably more common use case of longer lived services I don't think a couple of seconds of start up delay is going to matter that much.
We don't mandate short-lived vs long-lived, but registration entries would usually be towards the longer lived end so I don't expect to have any issues. There are some plans for using it in a place with shorter lived workloads, so I'll see at that point if I'll actually have any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating
x509_svid_cache_max_size
in v1.10.0 and remove it in v1.11.0.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that catches some of the more subtle way this changes deployments. For a while people have been used to thinking that the agent caches X509-SVIDs for all workloads. This means that if spire-server becomes unavailable for some time, you still have up to X509-SVID TTL/2 to fix it since the agent can serve svids from the cache.
This now changes it so that the agent caches up to 1000 X509-SVIDs, but no more until the workloads actually start and start requesting SVIDs. This means that the time to fix things can be shortened drastically. 1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.
I don't want to hard block this, but I think it's worth having the maintainers consider this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorindumitru totally understand :) I definitely appreciate wanting to protect global operator experience and the benefit of open source is being able to have these sorts of conversations and concerns raised.
iiuc , there is generally a desire from maintainers to limit just how many levers and knobs SPIRE has. It can already be difficult for some groups to adopt it, and each control is a potential point of confusion or breakage as well as a new dimension and permutation of deployment that maintainers need to support and receive issue/feature requests about. so that is a key motivator to removing this configuration. there is also definitely a split in operator groups- one that takes the binaries / source code as-is, and one that internally forks it to make adjustments (such as on hard-coded values as the most trivial example).
fwd: @amartinezfayo - if there's more to add not covered here or in #5383 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sorindumitru and @amoore877 for all the feedback and thoughts on this.
I think that the last comment from @amoore877 reflects pretty well what the maintainer's group analyzes when a new setting is b being introduced.
We have discussed this again in our last maintainer's call, and after reevaluating this, we feel that it would be prudent to keep the
x509_svid_cache_max_size
setting. The concerns pointed by @sorindumitru are some valid concerns. At this point, I personally think that being in a situation where you would need to adjust the cache size seems to be a lot more problematic than the fact that there is a new setting that can be tweaked, mostly considering that there will be a default value and users will not need to be aware of this setting if they don't need to adjust it. Having a proper documentation explaining when you may need to use this setting (e.g. when the agent handles more than 1000 active workloads) will help.I think that we can promote it as a stable setting. @amoore877 Could you update the PR to reflect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?
yet another PR (to reduce how much has to be reviewed) would then promote the setting out of experimental.
would we still be removing the non-LRU cache code?
would be good to update #4224 with the full goal state from maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the
x509_svid_cache_max_size
setting will not belong to the experimental section anymore, I think that the deprecation notice was OK, because the setting will not be recognized there anymore. In terms of closing this PR, I was thinking that it could just be updated to move thex509_svid_cache_max_size
setting out of the experimental section, being now a stable configurable. But if you prefer to handle that in a separate PR, that would work also.Yes, no changes with that, we are removing the non-LRU cache code in v1.11.0.
I've updated #4224 accordingly.