Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue of missing workqueue metrics in the Karmada controller #5945

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chaosi-zju
Copy link
Member

@chaosi-zju chaosi-zju commented Dec 13, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

remove the dependency on the k8s.io/component-base/metrics/prometheus/workqueue package to fix the issue of missing workqueue metrics in the Karmada controller.

Which issue(s) this PR fixes:

Fixes #5696

Special notes for your reviewer:

  • previous metrics number 900+, now 2400+
/ # curl http://127.0.0.1:8080/metrics | grep -v '#' | wc -l
2454

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: fix the issue of missing workqueue metrics in the Karmada controller

@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2024
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 13, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 13, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 48.17%. Comparing base (471d850) to head (46794dc).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...sourceinterpreter/customized/webhook/customized.go 0.00% 2 Missing ⚠️
pkg/karmadactl/promote/promote.go 0.00% 1 Missing ⚠️
pkg/resourceinterpreter/interpreter.go 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5945      +/-   ##
==========================================
+ Coverage   48.15%   48.17%   +0.01%     
==========================================
  Files         664      664              
  Lines       54803    54798       -5     
==========================================
+ Hits        26393    26400       +7     
+ Misses      26693    26683      -10     
+ Partials     1717     1715       -2     
Flag Coverage Δ
unittests 48.17% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chaosi-zju
Copy link
Member Author

chaosi-zju commented Dec 13, 2024

1. Why miss metrics

In Karmada controller, there are two workqueue metric registration, sigs.k8s.io/controller-runtime/pkg/metrics and k8s.io/component-base/metrics/prometheus/workqueue.

The simultaneous import of the two packages is not explicitly declared by us, we just want to use sigs.k8s.io/controller-runtime/pkg/metrics, while k8s.io/component-base/metrics/prometheus/workqueue is from the deep dependency.

They both register workqueue metric in init fucntion, but workqueue provider can only register once. So the import order becomes very important.

Now, you can check by go build --ldflags=--dumpdep github.com/karmada-io/karmada/cmd/controller-manager 2>&1 | grep inittask and find that k8s.io/component-base/metrics/prometheus/workqueue is imported earlier, so we set a unexpected workqueue provider which cause workqueue metrics missed.

func init() {
for _, m := range metrics {
legacyregistry.MustRegister(m)
}
workqueue.SetProvider(prometheusMetricsProvider{})
}

func init() {
Registry.MustRegister(depth)
Registry.MustRegister(adds)
Registry.MustRegister(latency)
Registry.MustRegister(workDuration)
Registry.MustRegister(unfinished)
Registry.MustRegister(longestRunningProcessor)
Registry.MustRegister(retries)
workqueue.SetProvider(workqueueMetricsProvider{})
}

// SetProvider sets the metrics provider for all subsequently created work
// queues. Only the first call has an effect.
func SetProvider(metricsProvider MetricsProvider) {
globalMetricsFactory.setProvider(metricsProvider)
}

func (f *queueMetricsFactory) setProvider(mp MetricsProvider) {
f.onlyOnce.Do(func() {
f.metricsProvider = mp
})
}

2. When introduced

In Karmada v1.9.9, we still use go v1.20, the Karmada controller didn't miss workqueue metrics:

➜  karmada git:(release-1.9) ✗ kh exec -it `kh get pods -n karmada-system | grep $comp | sed -n $i'p' | awk '{print $1}'` -n karmada-system -- sh
/ # curl http://127.0.0.1:8080/metrics | grep workqueue_work_duration_seconds_sum
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0workqueue_work_duration_seconds_sum{name="binding-eviction"} 0
workqueue_work_duration_seconds_sum{name="cluster"} 1.9943413949999997
workqueue_work_duration_seconds_sum{name="cluster-binding-eviction"} 0
workqueue_work_duration_seconds_sum{name="clusterPropagationPolicy reconciler"} 0
workqueue_work_duration_seconds_sum{name="clusterResourceBinding_status_controller"} 0
workqueue_work_duration_seconds_sum{name="clusterresourcebinding"} 0
workqueue_work_duration_seconds_sum{name="cronfederatedhpa"} 0
workqueue_work_duration_seconds_sum{name="dependencies resource detector"} 0
workqueue_work_duration_seconds_sum{name="endpointslice-collect"} 0
workqueue_work_duration_seconds_sum{name="federatedhpa"} 0
workqueue_work_duration_seconds_sum{name="federatedresourcequota"} 0
workqueue_work_duration_seconds_sum{name="hpa-replicas-syncer"} 0
workqueue_work_duration_seconds_sum{name="multiclusterservice"} 0
workqueue_work_duration_seconds_sum{name="namespace"} 0.005409679
100 95079    0 950workqueue_work_duration_seconds_sum{name="propagationPolicy reconciler"} 079
    0     0  14.2M      0workqueue_work_duration_seconds_sum{name="remedy-controller"} 0.394528854
 --:--:-- --:--:-- --:--workqueue_work_duration_seconds_sum{name="resource detector"} 0.35019135200000034
:-- 15.1M
workqueue_work_duration_seconds_sum{name="resourceBinding_status_controller"} 0
workqueue_work_duration_seconds_sum{name="resourcebinding"} 0
workqueue_work_duration_seconds_sum{name="scale ref worker"} 0
workqueue_work_duration_seconds_sum{name="service-export"} 0
workqueue_work_duration_seconds_sum{name="serviceimport"} 0
workqueue_work_duration_seconds_sum{name="work"} 0.9787631499999999
workqueue_work_duration_seconds_sum{name="work-status"} 0.32815969899999997

From Karmada v1.10.0, we use go v1.21, start introducing this problem, metrics missed:

➜  karmada git:(release-1.10) ✗ kh exec -it `kh get pods -n karmada-system | grep $comp | sed -n $i'p' | awk '{print $1}'` -n karmada-system -- sh
/ # curl http://127.0.0.1:8080/metrics | grep workqueue_work_duration_seconds_sum
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6474    0  6474    0     0  1794k      0 --:--:-- --:--:-- --:--:-- 2107k

3. Why old version fine

In go 1.20, we actively prioritize the import of package "sigs.k8s.io/controller-runtime/pkg/metrics" to ensure its init function called earlier. Refer to:

But, as go 1.21 release a change, go 1.21 release note:

Package initialization order is now specified more precisely. The new algorithm is:

Sort all packages by import path.
Repeat until the list of packages is empty:
Find the first package in the list for which all imports are already initialized.
Initialize that package and remove it from the list.

We cannot explicitly specify the import order; the init fucntion of k8s.io/component-base/metrics/prometheus/workqueue will always be imported earlier.

@@ -70,7 +68,7 @@ func NewCustomizedInterpreter(informer genericmanager.SingleClusterInformerManag
}

cm.SetAuthenticationInfoResolver(authInfoResolver)
cm.SetServiceResolver(apiserver.NewClusterIPServiceResolver(serviceLister))
cm.SetServiceResolver(webhookutil.NewDefaultServiceResolver())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, before #2999 we used exactly the webhookutil.NewDefaultServiceResolver(), we need to figure out why changed to apiserver.NewClusterIPServiceResolver(serviceLister).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
This essentially reverts #2999, and breaks the use case of running Karmada without coreDNS reported by @lxtywypc.

@karmada-bot karmada-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

karmada controller can't find metrics workqueue_depth
5 participants