Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle specific exception codes on RCI call. #4457

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

BinSquare
Copy link

@BinSquare BinSquare commented Dec 16, 2024

Summary

  1. This adds a terminalError wrapper around 2 exception codes that are known to not be recoverable by retrying (this can help avoid retry storm). This data was verified from internal investigation.
  2. Separately, it also checks that throttlingError does return exitcode 1 i.e. it will be retried.

Implementation details

Changes are implemented on agent calling RCi.

Testing

Added 2 new tests, one to verify throttlingError returns exitcode 1 + one to verify ClientException will result in exit code 5 i.e. the terminal state for agent.

New tests cover the changes:

Description for the changelog

Terminally exit on unrecoverable RCI exception errors.

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

No

Does this PR include the addition of new environment variables in the README?

No

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@BinSquare BinSquare requested a review from a team as a code owner December 16, 2024 18:05
@BinSquare BinSquare changed the base branch from master to dev December 20, 2024 22:35
@BinSquare BinSquare force-pushed the master branch 5 times, most recently from 49c48be to 7a5da69 Compare December 23, 2024 17:48
@BinSquare BinSquare changed the title Data model changes to consume ThrottlingExceptions. Handle specific exception codes on RCI call. Dec 23, 2024
@BinSquare BinSquare force-pushed the master branch 3 times, most recently from aa3e226 to 48fc82a Compare December 27, 2024 18:09
// which will allow agent to exit terminally.
func isTerminal(err error) bool {
// Check if the error is already wrapped as a terminalError
_, terminal := err.(terminalError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a situation in which this gets wrapped again? Why not use a deeper equality check like errors.Is which inspects the error tree rather than its first layer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be a case where it gets wrapped again - the terminal wrap should be the last layer as it's handled in this file. So in this case, both errors.Is and the current behaves the same.

Since existing isTransient is implemented similarly - hopefully this optimization shouldn't be a blocker?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocker, but in the case where we re-use these errors it may be a better long term change to wrap/unwrap these with the errors library.

if isTransient(err) {
if isTerminal(err) {
// On unrecoverable error codes, agent should terminally exit.
seelog.Warnf("Agent will terminally exit, unable to register container instance: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should use logger instead of seelog here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seelog.Warnf is chosen here to conform to existing style of using it for warning messages in this specific file i.e. there were 9 seelog.warn vs 2 logger.warn - so I opted for the more commonly used

There are 9 cases I used as reference found :

  1. seelog.Warnf("Error initializing metrics engine: %v", err)
  2. seelog.Warnf("Error creating telemetry session: %v", err)

Where as logger is also used but mostly for logger.critical and logger.info. There are only 2 cases of logger.Warn:

logger.Warn("Failed to get docker version from task engine", logger.Fields{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the justification. I am no expert but it seems more idiomatic to conform to the in-house logger used in this library

logger.Critical("Instance registration attempt with client performing invalid action", logger.Fields{
field.Error: err,
})
return terminalError{err}
}
if _, ok := err.(apierrors.AttributeError); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible this is also terminal error?

Instance registration attempt with invalid attribute(s)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling that specific case is beyond scope of this change.

However, invalidParameterException is handled above, I would not expect attributeError to fire because instance attributes are an instance parameter of the RCI call. Ecs agent should correct me here if I'm wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. We can focus on the cases we care about now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants