Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler auto discover fails with sagemaker hyperpod #7540

Open
Arthurhussey opened this issue Nov 28, 2024 · 1 comment
Open

Autoscaler auto discover fails with sagemaker hyperpod #7540

Arthurhussey opened this issue Nov 28, 2024 · 1 comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@Arthurhussey
Copy link

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 1.30

Component version:

What k8s version are you using (kubectl version)?:

$ Client Version: v1.30.0
$ Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
$ Server Version: v1.30.6-eks-7f9249a

What environment is this in?:

AWS/EKS

What did you expect to happen?:

Autoscaler pod should scale the ASGs as required, even when a sagemaker hyperpod cluster is attached.

ASG scaling

What happened instead?:

no autoscaling takes place, and i can see these errors in the logs

E1128 22:19:38.487185       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0059feb421f5b03ed
I1128 22:19:48.487306       1 static_autoscaler.go:306] Starting main loop
E1128 22:19:48.488169       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-00f55e8b5be774bd7
I1128 22:19:58.489198       1 static_autoscaler.go:306] Starting main loop
E1128 22:19:58.490443       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0a42a231e263cac0c
I1128 22:20:08.491624       1 static_autoscaler.go:306] Starting main loop
E1128 22:20:08.492629       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0059feb421f5b03ed
I1128 22:20:18.493453       1 static_autoscaler.go:306] Starting main loop
E1128 22:20:18.494944       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0cfbeeb3654698d80

How to reproduce it (as minimally and precisely as possible):

Setup cloud autoscaler with auto discover. The discovery and autoscaling works well
Add an AWS sagemaker hyperpod eks cluster
This will cause these error logs and no autoscaling

Anything else we need to know?:

@Arthurhussey Arthurhussey added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2024
@adrianmoisey
Copy link
Member

/area cluster-autoscaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants