Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Cluster Distribution bug:can not get shard number while restart app-controller pod and shard-cm is existed #20965

Open
ivan-cai opened this issue Nov 27, 2024 · 4 comments
Labels
bug Something isn't working component:application-controller component:sharding version:2.13 Latest confirmed affected version is 2.13

Comments

@ivan-cai
Copy link

ivan-cai commented Nov 27, 2024

Describe the bug

When I enable Dynamic Cluster Distribution feature, I have found a serious problem: application-controller would not process applications.

After I enabled the Dynamic Cluster Distribution feature, I started 3 application-controller(Deployment) replicas., and the shard configmap argocd-app-controller-shard-cm has been created. At this time, I restarted any replica, the new application-controller pod would not process applications, and application is stuck in refreshing. No errors.

The root cause is : If I restart one application-controller pod, the new created application-controller pod
default shard number is -1, and the heartbeat of the shard corresponding to the old pod in the configmap argocd-app-controller-shard-cm did not time out, so the shard number obtained by getOrUpdateShardNumberForController was still -1.
-1 will be set to clusterSharding.Shard, and will not enqueue any application add/update/delete event.
And the clusterSharding.Shard was not updated by ReadinessHealthCheck, ReadinessHealthCheck only update the configmap argocd-app-controller-shard-cm.

To Reproduce

  1. enable the Dynamic Cluster Distribution feature(optional: set controller log level to debug)
  2. scale up application-controller deployment to 3 replicas(> 1)
  3. rollout restart deployment or delete one pod for restarting

Expected behavior

After enabling the Dynamic Cluster Distribution feature, each application-controller shard can process applications normally.

Version

v2.13、v2.12、...

Logs

I get the following debug logs:

time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.17.223.90:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.55:6443 with clusterShard 1 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.201.207:6443 with clusterShard 1 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.16.191.172:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.17.223.93:6443 with clusterShard 1 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.16.191.172:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.201.211:6443 with clusterShard 2 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.56:6443 with clusterShard 2 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.56:6443 with clusterShard 2 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
time="2024-11-27T02:27:37Z" level=debug msg="Checking if cluster https://172.18.204.57:6443 with clusterShard 0 should be processed by shard -1"
@ivan-cai ivan-cai added the bug Something isn't working label Nov 27, 2024
@ivan-cai
Copy link
Author

ivan-cai commented Nov 27, 2024

@alexmt @crenshaw-dev @jessesuen I have found the root cause. But affected by the RollingUpdate strategy of deployment, I don't have a solution yet.

@ivan-cai
Copy link
Author

ivan-cai commented Dec 3, 2024

Hi @andrii-korotkov-verkada, I have fix this issue with pr #21042, can you help me assign some reviewers?

@FredrikAugust
Copy link
Contributor

I think we've run into this now, and are as a result stuck. No replica seems to want to pick up the clusters assigned to shard 0.

@FredrikAugust
Copy link
Contributor

The workaround for us was to just repeatedly kill the shard-0 replica until it started behaving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:application-controller component:sharding version:2.13 Latest confirmed affected version is 2.13
Projects
None yet
Development

No branches or pull requests

3 participants