fix: dynamic cluster distribution issue 20965, update the shard… #21042

ivan-cai · 2024-12-03T11:06:19Z

… number by readiness health check

bugfix for #20965, Dynamic Cluster Distribution

When the shard configmap argocd-app-controller-shard-cm exists and the number of replicas is greater than 1, the shard number of the app controller pod restarted due to image update or pod eviction will be -1. Then, the app controller shard will not process applications, and applications stuck in refreshing.

We try to update the shard number of the application controller(clusterSharding.shard and liveStateCache.clusterSharding.shard) in the readiness health check if the shard number have changed, and the resync all applications.

bunnyshell · 2024-12-03T11:06:28Z

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

🔵 /bns:start to start the environment
🚀 /bns:deploy to redeploy the environment
❌ /bns:delete to remove the environment

codecov · 2024-12-03T12:56:58Z

Codecov Report

Attention: Patch coverage is 0% with 46 lines in your changes missing coverage. Please review.

Project coverage is 55.16%. Comparing base (f429352) to head (66121a8).

Files with missing lines	Patch %	Lines
controller/sharding/cache.go	0.00%	22 Missing ⚠️
controller/appcontroller.go	0.00%	19 Missing ⚠️
controller/sharding/sharding.go	0.00%	2 Missing and 1 partial ⚠️
controller/cache/cache.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #21042      +/-   ##
==========================================
- Coverage   55.19%   55.16%   -0.04%     
==========================================
  Files         337      337              
  Lines       57055    57099      +44     
==========================================
+ Hits        31492    31497       +5     
- Misses      22870    22913      +43     
+ Partials     2693     2689       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andrii-korotkov-verkada · 2024-12-03T14:16:55Z

controller/appcontroller.go

+						key, err := cache.MetaNamespaceKeyFunc(app)
+						if err == nil {
+							ctrl.appRefreshQueue.AddRateLimited(key)
+							ctrl.appOperationQueue.AddRateLimited(key)


No need to add this, it would be added by refresh processing code after refresh is completed.

This resync(enqueue) code is necessary. Otherwise, even if the shard number is changed, the application controller will not process the applications immediately, but will wait for the default resync time.

Would refresh not be triggered immediately?

With this resync code, it will work, otherwise it will not.

The code has changes in 2.13, where refresh enqueues app operation after being done. This prevents some race conditions and makes sure the sync is done with a fresh version of data.

So basically we need two resyncs, one initially and one after the refresh?

the application controller will not process the applications immediately, but will wait for the default resync time.

Why would that happen?

So basically we need two resyncs, one initially and one after the refresh?

Yes, I have seen the changes, I will delete this line ctrl.appOperationQueue.AddRateLimited(key).
Thank you for your review!

Ah okay. Good to clarify. We have one case where we may enqueue app operation immediately and another after some delay.

andrii-korotkov-verkada · 2024-12-03T14:17:22Z

controller/appcontroller.go

+					}
+					for _, app := range apps {
+						if !ctrl.canProcessApp(app) {
+							return nil


Did you mean to continue instead?

Did you mean to continue instead?

It is the same as the AddFunc code, only enqueue the applications that the shard is responsible for.

But I mean you'd return from the function in the middle of the loop going through apps, so some apps may be left behind even if ctrl can process them.

Yes, you are right.

ivan-cai · 2024-12-04T02:19:00Z

controller/appcontroller.go

+						if !ctrl.canProcessApp(app) {
+							continue
+						}
+						key, err := cache.MetaNamespaceKeyFunc(app)
+						if err == nil {
+							ctrl.appRefreshQueue.AddRateLimited(key)
+						}
+						if err == nil {


@andrii-korotkov-verkada I have fixed the two problems you mentioned.

andrii-korotkov-verkada · 2024-12-04T02:19:20Z

controller/appcontroller.go

+						}
+						if err == nil {
+							ctrl.clusterSharding.AddApp(app)
+						}


Combine the two blocks, since they check for the same condition. Or, do continue if err != nil. We also probably wanna log the error.

andrii-korotkov-verkada · 2024-12-04T02:28:24Z

Approved. Btw, consider using new commits instead of force pushing, that can help to see which changes you are making and tell some story. They are squashed when merged.

ivan-cai · 2024-12-04T03:34:13Z

/retest

…r by readiness health check Signed-off-by: caijing <[email protected]>

ivan-cai requested a review from a team as a code owner December 3, 2024 11:06

ivan-cai changed the title ~~bugfix for dynamic cluster distribution issue 20965, update the shard…~~ fix: dynamic cluster distribution issue 20965, update the shard… Dec 3, 2024

ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch 6 times, most recently from 6698dc3 to c93c7b3 Compare December 3, 2024 12:14

ivan-cai mentioned this pull request Dec 3, 2024

Dynamic Cluster Distribution bug：can not get shard number while restart app-controller pod and shard-cm is existed #20965

Open

andrii-korotkov-verkada reviewed Dec 3, 2024

View reviewed changes

ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch from c93c7b3 to 98f58c1 Compare December 4, 2024 02:16

ivan-cai commented Dec 4, 2024

View reviewed changes

andrii-korotkov-verkada reviewed Dec 4, 2024

View reviewed changes

ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch from 98f58c1 to a31de81 Compare December 4, 2024 02:24

andrii-korotkov-verkada approved these changes Dec 4, 2024

View reviewed changes

andrii-korotkov-verkada added the ready-for-review label Dec 4, 2024

fix: dynamic cluster distribution issue 20965, update the shard numbe…

66121a8

…r by readiness health check Signed-off-by: caijing <[email protected]>

ivan-cai force-pushed the bugfix_sharding_dynamic_cluster_distribution branch from e32bc65 to 66121a8 Compare December 24, 2024 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dynamic cluster distribution issue 20965, update the shard… #21042

fix: dynamic cluster distribution issue 20965, update the shard… #21042

ivan-cai commented Dec 3, 2024

bunnyshell bot commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

andrii-korotkov-verkada Dec 3, 2024

ivan-cai Dec 4, 2024

andrii-korotkov-verkada Dec 4, 2024

ivan-cai Dec 4, 2024

andrii-korotkov-verkada Dec 4, 2024

andrii-korotkov-verkada Dec 4, 2024

andrii-korotkov-verkada Dec 4, 2024

ivan-cai Dec 4, 2024 •

edited

Loading

andrii-korotkov-verkada Dec 4, 2024

andrii-korotkov-verkada Dec 3, 2024

ivan-cai Dec 4, 2024

andrii-korotkov-verkada Dec 4, 2024

ivan-cai Dec 4, 2024

ivan-cai Dec 4, 2024

andrii-korotkov-verkada Dec 4, 2024

ivan-cai Dec 4, 2024

andrii-korotkov-verkada commented Dec 4, 2024

ivan-cai commented Dec 4, 2024

fix: dynamic cluster distribution issue 20965, update the shard… #21042

Are you sure you want to change the base?

fix: dynamic cluster distribution issue 20965, update the shard… #21042

Conversation

ivan-cai commented Dec 3, 2024

bunnyshell bot commented Dec 3, 2024 • edited Loading

🔴 Preview Environment stopped on Bunnyshell

codecov bot commented Dec 3, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivan-cai Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrii-korotkov-verkada commented Dec 4, 2024

ivan-cai commented Dec 4, 2024

bunnyshell bot commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

ivan-cai Dec 4, 2024 •

edited

Loading