Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete non-LRU cache in SPIRE Agent #5383

Merged
merged 22 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 0 additions & 17 deletions cmd/spire-agent/cli/run/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,10 +120,6 @@ type experimentalConfig struct {
UseSyncAuthorizedEntries bool `hcl:"use_sync_authorized_entries"`

Flags fflag.RawConfig `hcl:"feature_flags"`

UnusedKeyPositions map[string][]token.Pos `hcl:",unusedKeyPositions"`
X509SVIDCacheMaxSize int `hcl:"x509_svid_cache_max_size"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we shouldn't keep this setting but promote it out of experimental. It has some implications for existing deployments, namely that the initial fetch of a SVID may become slower now if there are more active workloads than 1000. It can become noticeable for the case where you run lots of short lived workloads.

Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?

so this is actually what is being executed. 1.10 already has a merged PR #5150 for deprecating this feature (via warnings in logs). This PR is planned for 1.11 to finally remove it entirely.

I'm sure if there are raised complaints wrt to the deprecation and removal, SPIRE maintainers and the community would be happy to discuss. You can raise this in SPIFFE slack or in contributor sync

It can become noticeable for the case where you run lots of short lived workloads.

fwiw, and this of course would vary significantly based on operator environment, my organization's own load testing showed on the version this feature was introduced at most 1-3s of fetchx509svid latency experienced by consumers when the cache size was set to 1k and the agent was being cycled through 26k registrations.

I'd be curious to know for your own environment how registrations are being managed for short-lived- are they always statically registered? if not, are they aggressively or lazily culled? from a security perspective it's best to have an agent only be assigned precisely the workload identities that are expected to be running with it in that exact moment, though of course there are windows where there would be excess dependent on strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just something I wanted to raise as a potential issue as I've ran into a similar issue with other pieces of software. 1-3 seconds for something that is supposed to be started lots of times, for example batch jobs, can add up to a lot. With the probably more common use case of longer lived services I don't think a couple of seconds of start up delay is going to matter that much.

We don't mandate short-lived vs long-lived, but registration entries would usually be towards the longer lived end so I don't expect to have any issues. There are some plans for using it in a place with shorter lived workloads, so I'll see at that point if I'll actually have any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we shouldn't keep this setting but promote it out of experimental

The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating x509_svid_cache_max_size in v1.10.0 and remove it in v1.11.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that catches some of the more subtle way this changes deployments. For a while people have been used to thinking that the agent caches X509-SVIDs for all workloads. This means that if spire-server becomes unavailable for some time, you still have up to X509-SVID TTL/2 to fix it since the agent can serve svids from the cache.

This now changes it so that the agent caches up to 1000 X509-SVIDs, but no more until the workloads actually start and start requesting SVIDs. This means that the time to fix things can be shortened drastically. 1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.

I don't want to hard block this, but I think it's worth having the maintainers consider this a bit more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorindumitru totally understand :) I definitely appreciate wanting to protect global operator experience and the benefit of open source is being able to have these sorts of conversations and concerns raised.

1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.

iiuc , there is generally a desire from maintainers to limit just how many levers and knobs SPIRE has. It can already be difficult for some groups to adopt it, and each control is a potential point of confusion or breakage as well as a new dimension and permutation of deployment that maintainers need to support and receive issue/feature requests about. so that is a key motivator to removing this configuration. there is also definitely a split in operator groups- one that takes the binaries / source code as-is, and one that internally forks it to make adjustments (such as on hard-coded values as the most trivial example).

fwd: @amartinezfayo - if there's more to add not covered here or in #5383 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sorindumitru and @amoore877 for all the feedback and thoughts on this.
I think that the last comment from @amoore877 reflects pretty well what the maintainer's group analyzes when a new setting is b being introduced.
We have discussed this again in our last maintainer's call, and after reevaluating this, we feel that it would be prudent to keep the x509_svid_cache_max_size setting. The concerns pointed by @sorindumitru are some valid concerns. At this point, I personally think that being in a situation where you would need to adjust the cache size seems to be a lot more problematic than the fact that there is a new setting that can be tweaked, mostly considering that there will be a default value and users will not need to be aware of this setting if they don't need to adjust it. Having a proper documentation explaining when you may need to use this setting (e.g. when the agent handles more than 1000 active workloads) will help.
I think that we can promote it as a stable setting. @amoore877 Could you update the PR to reflect that?

Copy link
Member Author

@amoore877 amoore877 Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?

yet another PR (to reduce how much has to be reviewed) would then promote the setting out of experimental.

would we still be removing the non-LRU cache code?

would be good to update #4224 with the full goal state from maintainers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?

Since the x509_svid_cache_max_size setting will not belong to the experimental section anymore, I think that the deprecation notice was OK, because the setting will not be recognized there anymore. In terms of closing this PR, I was thinking that it could just be updated to move the x509_svid_cache_max_size setting out of the experimental section, being now a stable configurable. But if you prefer to handle that in a separate PR, that would work also.

would we still be removing the non-LRU cache code?

Yes, no changes with that, we are removing the non-LRU cache code in v1.11.0.

would be good to update #4224 with the full goal state from maintainers

I've updated #4224 accordingly.

DisableLRUCache bool `hcl:"disable_lru_cache"`
}

type Command struct {
Expand Down Expand Up @@ -498,19 +494,6 @@ func NewAgentConfig(c *Config, logOptions []log.Option, allowUnknownConfig bool)
ac.LogReopener = log.ReopenOnSignal(logger, reopenableFile)
}

if c.Agent.Experimental.X509SVIDCacheMaxSize < 0 {
return nil, errors.New("x509_svid_cache_max_size should not be negative")
}
if c.Agent.Experimental.X509SVIDCacheMaxSize > 0 || c.Agent.Experimental.DisableLRUCache {
logger.Warn("The `x509_svid_cache_max_size` and `disable_lru_cache` configurations are deprecated. They will be removed in a future release.")
}
ac.X509SVIDCacheMaxSize = c.Agent.Experimental.X509SVIDCacheMaxSize

if c.Agent.Experimental.DisableLRUCache && ac.X509SVIDCacheMaxSize != 0 {
return nil, errors.New("x509_svid_cache_max_size should not be set when disable_lru_cache is set")
}
ac.DisableLRUCache = c.Agent.Experimental.DisableLRUCache

td, err := common_cli.ParseTrustDomain(c.Agent.TrustDomain, logger)
if err != nil {
return nil, err
Expand Down
100 changes: 0 additions & 100 deletions cmd/spire-agent/cli/run/run_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -897,106 +897,6 @@ func TestNewAgentConfig(t *testing.T) {
require.Nil(t, c)
},
},
{
msg: "x509_svid_cache_max_size is set",
input: func(c *Config) {
c.Agent.Experimental.X509SVIDCacheMaxSize = 100
},
logOptions: func(t *testing.T) []log.Option {
return []log.Option{
func(logger *log.Logger) error {
logger.SetOutput(io.Discard)
hook := test.NewLocal(logger.Logger)
t.Cleanup(func() {
spiretest.AssertLogsContainEntries(t, hook.AllEntries(), []spiretest.LogEntry{
{
Level: logrus.WarnLevel,
Message: "The `x509_svid_cache_max_size` and `disable_lru_cache` " +
"configurations are deprecated. They will be removed in a future release.",
},
})
})
return nil
},
}
},
test: func(t *testing.T, c *agent.Config) {
require.EqualValues(t, 100, c.X509SVIDCacheMaxSize)
},
},
{
msg: "x509_svid_cache_max_size is not set",
input: func(c *Config) {
},
test: func(t *testing.T, c *agent.Config) {
require.EqualValues(t, 0, c.X509SVIDCacheMaxSize)
},
},
{
msg: "x509_svid_cache_max_size is zero",
input: func(c *Config) {
c.Agent.Experimental.X509SVIDCacheMaxSize = 0
},
test: func(t *testing.T, c *agent.Config) {
require.EqualValues(t, 0, c.X509SVIDCacheMaxSize)
},
},
{
msg: "x509_svid_cache_max_size is negative",
expectError: true,
input: func(c *Config) {
c.Agent.Experimental.X509SVIDCacheMaxSize = -10
},
test: func(t *testing.T, c *agent.Config) {
require.Nil(t, c)
},
},
{
msg: "disable_lru_cache is set",
input: func(c *Config) {
c.Agent.Experimental.DisableLRUCache = true
},
logOptions: func(t *testing.T) []log.Option {
return []log.Option{
func(logger *log.Logger) error {
logger.SetOutput(io.Discard)
hook := test.NewLocal(logger.Logger)
t.Cleanup(func() {
spiretest.AssertLogsContainEntries(t, hook.AllEntries(), []spiretest.LogEntry{
{
Level: logrus.WarnLevel,
Message: "The `x509_svid_cache_max_size` and `disable_lru_cache` " +
"configurations are deprecated. They will be removed in a future release.",
},
})
})
return nil
},
}
},
test: func(t *testing.T, c *agent.Config) {
require.True(t, c.DisableLRUCache)
},
},
{
msg: "both disable_lru_cache and x509_svid_cache_max_size are set",
expectError: true,
input: func(c *Config) {
c.Agent.Experimental.DisableLRUCache = true
c.Agent.Experimental.X509SVIDCacheMaxSize = 100
},
test: func(t *testing.T, c *agent.Config) {
require.Nil(t, c)
},
},
{
msg: "disable_lru_cache is not set",
input: func(c *Config) {
},
test: func(t *testing.T, c *agent.Config) {
require.False(t, c.DisableLRUCache)
},
},
{
msg: "allowed_foreign_jwt_claims provided",
input: func(c *Config) {
Expand Down
4 changes: 1 addition & 3 deletions doc/spire_agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,6 @@ This may be useful for templating configuration files, for example across differ
|:------------------------------|--------------------------------------------------------------------------------------|-------------------------|
| `named_pipe_name` | Pipe name to bind the SPIRE Agent API named pipe (Windows only) | \spire-agent\public\api |
| `sync_interval` | Sync interval with SPIRE server with exponential backoff | 5 sec |
| `x509_svid_cache_max_size` | Soft limit of max number of SVIDs that would be stored in LRU cache (deprecated) | 1000 |
| `disable_lru_cache` | Reverts back to use the SPIRE Agent non-LRU cache for storing SVIDs (deprecated) | false |
| `use_sync_authorized_entries` | Use SyncAuthorizedEntries API for periodically synchronization of authorized entries | false |

### Initial trust bundle configuration
Expand Down Expand Up @@ -387,7 +385,7 @@ There are two ways the trusted delegate workload can request SVIDs for other wor
In this approach, the trusted delegate workload is entirely responsible for attesting the other workload and building the attested selectors.
When those selectors are presented to the SPIRE Agent, the SPIRE Agent will simply return SVIDs for any workload registration entries that match the provided selectors.
No other checks or attestations will be performed by the SPIRE Agent.

1. By obtaining a PID for the other workload, and providing that PID to the SPIRE Agent over the Delegated Identity API.
In this approach, the SPIRE Agent will do attestation for the provided PID, build the attested selectors, and return SVIDs for any workload registration entries that match the selectors the SPIRE Agent attested from that PID.
This differs from the previous approach in that the SPIRE Agent itself (not the trusted delegate) handles the attestation of the other workload.
Expand Down
2 changes: 0 additions & 2 deletions pkg/agent/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -275,8 +275,6 @@ func (a *Agent) newManager(ctx context.Context, sto storage.Storage, cat catalog
Storage: sto,
SyncInterval: a.c.SyncInterval,
UseSyncAuthorizedEntries: a.c.UseSyncAuthorizedEntries,
SVIDCacheMaxSize: a.c.X509SVIDCacheMaxSize,
DisableLRUCache: a.c.DisableLRUCache,
SVIDStoreCache: cache,
NodeAttestor: na,
RotationStrategy: rotationutil.NewRotationStrategy(a.c.AvailabilityTarget),
Expand Down
5 changes: 3 additions & 2 deletions pkg/agent/api/delegatedidentity/v1/service_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (
"testing"
"time"

"github.com/andres-erbsen/clock"
"github.com/sirupsen/logrus"
"github.com/sirupsen/logrus/hooks/test"
"github.com/spiffe/go-spiffe/v2/bundle/spiffebundle"
Expand Down Expand Up @@ -993,9 +994,9 @@ func (m *FakeManager) SubscribeToBundleChanges() *cache.BundleStream {
return myCache.BundleCache.SubscribeToBundleChanges()
}

func newTestCache() *cache.Cache {
func newTestCache() *cache.LRUCache {
log, _ := test.NewNullLogger()
return cache.New(log, trustDomain1, bundle1, telemetry.Blackhole{})
return cache.NewLRUCache(log, trustDomain1, bundle1, telemetry.Blackhole{}, clock.New())
}

func generateSubscribeToX509SVIDMetrics() []fakemetrics.MetricItem {
Expand Down
6 changes: 0 additions & 6 deletions pkg/agent/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,6 @@ type Config struct {
// is used to sync entries from the server.
UseSyncAuthorizedEntries bool

// X509SVIDCacheMaxSize is a soft limit of max number of SVIDs that would be stored in cache
X509SVIDCacheMaxSize int

// DisableLRUCache disables the SPIRE Agent LRU cache used for storing SVIDs and fallback to original cache
DisableLRUCache bool

// Trust domain and associated CA bundle
TrustDomain spiffeid.TrustDomain
TrustBundle []*x509.Certificate
Expand Down
3 changes: 3 additions & 0 deletions pkg/agent/manager/cache/bundle_cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,12 @@ package cache

import (
"github.com/imkira/go-observer"
"github.com/spiffe/go-spiffe/v2/bundle/spiffebundle"
"github.com/spiffe/go-spiffe/v2/spiffeid"
)

type Bundle = spiffebundle.Bundle

type BundleCache struct {
trustDomain spiffeid.TrustDomain
bundles observer.Property
Expand Down
Loading
Loading