Dynamic Cluster Distribution bug：can not get shard number while restart app-controller pod and shard-cm is existed #20965

ivan-cai · 2024-11-27T02:59:32Z

Describe the bug

When I enable Dynamic Cluster Distribution feature, I have found a serious problem: application-controller would not process applications.

After I enabled the Dynamic Cluster Distribution feature, I started 3 application-controller(Deployment) replicas., and the shard configmap argocd-app-controller-shard-cm has been created. At this time, I restarted any replica, the new application-controller pod would not process applications, and application is stuck in refreshing. No errors.

The root cause is : If I restart one application-controller pod, the new created application-controller pod
default shard number is -1, and the heartbeat of the shard corresponding to the old pod in the configmap argocd-app-controller-shard-cm did not time out, so the shard number obtained by getOrUpdateShardNumberForController was still -1.
-1 will be set to clusterSharding.Shard, and will not enqueue any application add/update/delete event.
And the clusterSharding.Shard was not updated by ReadinessHealthCheck, ReadinessHealthCheck only update the configmap argocd-app-controller-shard-cm.

To Reproduce

enable the Dynamic Cluster Distribution feature（optional: set controller log level to debug）
scale up application-controller deployment to 3 replicas（> 1）
rollout restart deployment or delete one pod for restarting

Expected behavior

After enabling the Dynamic Cluster Distribution feature, each application-controller shard can process applications normally.

Version

v2.13、v2.12、...

Logs

I get the following debug logs:

time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.17.223.90:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.55:6443 with clusterShard 1 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.201.207:6443 with clusterShard 1 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.16.191.172:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.17.223.93:6443 with clusterShard 1 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.16.191.172:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.201.211:6443 with clusterShard 2 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.56:6443 with clusterShard 2 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.56:6443 with clusterShard 2 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"

The text was updated successfully, but these errors were encountered:

ivan-cai · 2024-11-27T03:08:01Z

@alexmt @crenshaw-dev @jessesuen I have found the root cause. But affected by the RollingUpdate strategy of deployment, I don't have a solution yet.

ivan-cai · 2024-12-03T13:02:21Z

Hi @andrii-korotkov-verkada, I have fix this issue with pr #21042, can you help me assign some reviewers?

FredrikAugust · 2024-12-09T10:48:47Z

I think we've run into this now, and are as a result stuck. No replica seems to want to pick up the clusters assigned to shard 0.

FredrikAugust · 2024-12-09T10:50:40Z

The workaround for us was to just repeatedly kill the shard-0 replica until it started behaving.

ivan-cai added the bug Something isn't working label Nov 27, 2024

andrii-korotkov-verkada added version:2.13 Latest confirmed affected version is 2.13 component:application-controller component:sharding labels Nov 27, 2024

ivan-cai mentioned this issue Dec 3, 2024

fix: dynamic cluster distribution issue 20965, update the shard… #21042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Cluster Distribution bug：can not get shard number while restart app-controller pod and shard-cm is existed #20965

Dynamic Cluster Distribution bug：can not get shard number while restart app-controller pod and shard-cm is existed #20965

ivan-cai commented Nov 27, 2024 •

edited

Loading

ivan-cai commented Nov 27, 2024 •

edited

Loading

ivan-cai commented Dec 3, 2024

FredrikAugust commented Dec 9, 2024

FredrikAugust commented Dec 9, 2024

Dynamic Cluster Distribution bug：can not get shard number while restart app-controller pod and shard-cm is existed #20965

Dynamic Cluster Distribution bug：can not get shard number while restart app-controller pod and shard-cm is existed #20965

Comments

ivan-cai commented Nov 27, 2024 • edited Loading

ivan-cai commented Nov 27, 2024 • edited Loading

ivan-cai commented Dec 3, 2024

FredrikAugust commented Dec 9, 2024

FredrikAugust commented Dec 9, 2024

ivan-cai commented Nov 27, 2024 •

edited

Loading

ivan-cai commented Nov 27, 2024 •

edited

Loading