-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics to acs for eni provisioning workflow monitoring #4443
Conversation
ecs-agent/metrics/constants.go
Outdated
@@ -46,6 +46,11 @@ const ( | |||
ACSDisconnectTimeoutMetricName = agentAvailabilityNamespace + ".ACSDisconnectTimeout" | |||
TCSDisconnectTimeoutMetricName = agentAvailabilityNamespace + ".TCSDisconnectTimeout" | |||
|
|||
// ACS Session Metrics | |||
acsStartSessionNamespace = "ACSStartSession" | |||
ACSDiscoverPollEndpointDurationName = acsStartSessionNamespace + ".DiscoverPollEndpointDuration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we calling these two duration metrics? What duration is being measured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified offline, these metrics include a duration by default which will tell us the connection duration.
I don't think we're supposed to merge into the master branch, since master is used for version release. We should merge into dev instead |
@@ -262,8 +274,13 @@ func (s *session) startSessionOnce(ctx context.Context) error { | |||
}) | |||
return err | |||
} | |||
acsConnectionMetric.Done(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ACS metric won't fire if client.Connect
returns a non-nil error. Is that intentional? If so, this behavior is inconsistent with the DiscoverPollEndpoint metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because ACS infinitely retries its connection, is this metric going to be too noisy if we fire a failure metric on every failed connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think we would want to fire metrics on failure - to be consistent with DiscoverPollEndpoint i think i will only fire the DiscoverPollEndpoint metric for successful calls
ecs-agent/acs/session/session.go
Outdated
s.firstDiscoverPollEndpointTime = time.Now() | ||
} | ||
|
||
discoverPollEndpointMetric := s.metricsFactory.New(metrics.ACSDiscoverPollEndpointDurationName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DiscoverPollEndpoint can be called by TACS session as well. Can we emit the metric in this single place - https://github.com/aws/amazon-ecs-agent/blob/master/ecs-agent/api/ecs/client/ecs_client.go#L679? which is used by both ACS and TACS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will move the metric over there
Summary
Adding new metrics and variables in acs session
Implementation details
Added new metrics and variables in acs session for
StartSessionOnce
Testing
Tested by running all unit tests in acs
New tests cover the changes:
No new tests, but modified existing tests to account for new metrics (ie mocked the calls to create metrics)
Description for the changelog
Enhancement - add metrics to monitor latency in acs session
Additional Information
Does this PR include breaking model changes? If so, Have you added transformation functions?
No
Does this PR include the addition of new environment variables in the README?
No
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.