-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: dynamic cluster distribution issue 20965, update the shard… #21042
base: master
Are you sure you want to change the base?
fix: dynamic cluster distribution issue 20965, update the shard… #21042
Conversation
🔴 Preview Environment stopped on BunnyshellSee: Environment Details | Pipeline Logs Available commands (reply to this comment):
|
6698dc3
to
c93c7b3
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21042 +/- ##
==========================================
- Coverage 55.19% 55.16% -0.04%
==========================================
Files 337 337
Lines 57055 57099 +44
==========================================
+ Hits 31492 31497 +5
- Misses 22870 22913 +43
+ Partials 2693 2689 -4 ☔ View full report in Codecov by Sentry. |
controller/appcontroller.go
Outdated
key, err := cache.MetaNamespaceKeyFunc(app) | ||
if err == nil { | ||
ctrl.appRefreshQueue.AddRateLimited(key) | ||
ctrl.appOperationQueue.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add this, it would be added by refresh processing code after refresh is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resync(enqueue) code is necessary. Otherwise, even if the shard number is changed, the application controller will not process the applications immediately, but will wait for the default resync time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would refresh not be triggered immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this resync code, it will work, otherwise it will not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has changes in 2.13, where refresh enqueues app operation after being done. This prevents some race conditions and makes sure the sync is done with a fresh version of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically we need two resyncs, one initially and one after the refresh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the application controller will not process the applications immediately, but will wait for the default resync time.
Why would that happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically we need two resyncs, one initially and one after the refresh?
Yes, I have seen the changes, I will delete this line ctrl.appOperationQueue.AddRateLimited(key)
.
Thank you for your review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. Good to clarify. We have one case where we may enqueue app operation immediately and another after some delay.
controller/appcontroller.go
Outdated
} | ||
for _, app := range apps { | ||
if !ctrl.canProcessApp(app) { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to continue instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to continue instead?
It is the same as the AddFunc
code, only enqueue the applications that the shard is responsible for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I mean you'd return from the function in the middle of the loop going through apps, so some apps may be left behind even if ctrl can process them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
c93c7b3
to
98f58c1
Compare
controller/appcontroller.go
Outdated
if !ctrl.canProcessApp(app) { | ||
continue | ||
} | ||
key, err := cache.MetaNamespaceKeyFunc(app) | ||
if err == nil { | ||
ctrl.appRefreshQueue.AddRateLimited(key) | ||
} | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrii-korotkov-verkada I have fixed the two problems you mentioned.
} | ||
if err == nil { | ||
ctrl.clusterSharding.AddApp(app) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine the two blocks, since they check for the same condition. Or, do continue if err != nil. We also probably wanna log the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined!
98f58c1
to
a31de81
Compare
Approved. Btw, consider using new commits instead of force pushing, that can help to see which changes you are making and tell some story. They are squashed when merged. |
/retest |
…r by readiness health check Signed-off-by: caijing <[email protected]>
e32bc65
to
66121a8
Compare
… number by readiness health check
bugfix for #20965, Dynamic Cluster Distribution
When the shard configmap
argocd-app-controller-shard-cm
exists and the number of replicas is greater than 1, the shard number of the app controller pod restarted due to image update or pod eviction will be -1. Then, the app controller shard will not process applications, and applications stuck in refreshing.We try to update the shard number of the application controller(
clusterSharding.shard
andliveStateCache.clusterSharding.shard
) in the readiness health check if the shard number have changed, and the resync all applications.