Autoscaler auto discover fails with sagemaker hyperpod #7540

Arthurhussey · 2024-11-28T22:26:21Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 1.30

Component version:

What k8s version are you using (kubectl version)?:

$ Client Version: v1.30.0
$ Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
$ Server Version: v1.30.6-eks-7f9249a

What environment is this in?:

AWS/EKS

What did you expect to happen?:

Autoscaler pod should scale the ASGs as required, even when a sagemaker hyperpod cluster is attached.

ASG scaling

What happened instead?:

no autoscaling takes place, and i can see these errors in the logs

E1128 22:19:38.487185       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0059feb421f5b03ed
I1128 22:19:48.487306       1 static_autoscaler.go:306] Starting main loop
E1128 22:19:48.488169       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-00f55e8b5be774bd7
I1128 22:19:58.489198       1 static_autoscaler.go:306] Starting main loop
E1128 22:19:58.490443       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0a42a231e263cac0c
I1128 22:20:08.491624       1 static_autoscaler.go:306] Starting main loop
E1128 22:20:08.492629       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0059feb421f5b03ed
I1128 22:20:18.493453       1 static_autoscaler.go:306] Starting main loop
E1128 22:20:18.494944       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0cfbeeb3654698d80

How to reproduce it (as minimally and precisely as possible):

Setup cloud autoscaler with auto discover. The discovery and autoscaling works well
Add an AWS sagemaker hyperpod eks cluster
This will cause these error logs and no autoscaling

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

adrianmoisey · 2024-11-29T08:47:55Z

/area cluster-autoscaler

Arthurhussey added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2024

k8s-ci-robot added the area/cluster-autoscaler label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler auto discover fails with sagemaker hyperpod #7540

Autoscaler auto discover fails with sagemaker hyperpod #7540

Arthurhussey commented Nov 28, 2024

adrianmoisey commented Nov 29, 2024

Autoscaler auto discover fails with sagemaker hyperpod #7540

Autoscaler auto discover fails with sagemaker hyperpod #7540

Comments

Arthurhussey commented Nov 28, 2024

adrianmoisey commented Nov 29, 2024