-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_controller cpu and memory use increases over time, even with no traffic. #262
Comments
@metskem @pusherofbrooms are you seeing the usage increase on the API vms or workers? or both? @philippthun are you seeing the usage increase on just the workers, or also on API vms? |
We see it only on the API vms. |
The api instances is where we see CPU and memory leakage, specifically the cloud_controller_ng process. |
We (when running CAPI 1.132.0 with Ruby 3.0) are only seeing this on We did some local tests (now with Ruby 3.1) and profiled the worker process with
As there are no jobs, the worker is sleeping most of the time. But besides that it's constantly reloading the ccng routes!? The worker does not need the routes at all, right? I also used heap-profiler on the bigger environment (also for 5 minutes) and heapy to analyze the retained heap dump:
So the retained memory seems to be related to this (unnecessary) route reloading. To stop this constant reloading, one can set
Although this could mitigate the problem, it still does not explain why memory is increasing since Ruby 3 (route reloading was also done before). And it is also not applicable to the api processes where others are seeing increasing memory consumption. |
We also noticed the workers reloading due to the |
We're still looking at this, but we've noticed a discrepancy between |
`RAILS_ENV` should be set to `production` to prevent rails/active support from reloading routes. In context of a cc worker this feature is not needed because there are no routes. Related to #262
Regarding the issue on the API nodes: Do we know if the issue is occurring on |
Hi we on the @cloudfoundry/cf-autoscaler team are seeing this happen very quickly in our acceptance test environment. |
greetings @johha
So workers seem to be top memory users, but the cloud_controller_ng process seems to be the top cpu user. |
@moleske No, we did not look at |
@pusherofbrooms Okay, so the memory issue might go away with the change to To understand what the ccng process is doing, could you run
Maybe you can also change the duration to e.g. 300 (seconds) to get some more data. |
@philippthun what we were observing (on the API process) was that heap-profiler would report around 225MB of objects in its memory, but monit status would report that |
@sethboyles We didn't really look at the absolute values reported by We've set the correct |
@philippthun we find it curious that you all are not experiencing issues with the API process like other are--is that correct? Do you think it might be related to how you all have capi-release/jobs/cloud_controller_ng/spec Lines 1226 to 1228 in 9007cb2
We are thinking that maybe there is an issue with yajl, which is what CCNG uses by default, and in our heap profiling we did see yajl taking up a lot of memory. We also found a few issues related to yajl's performance: https://bugs.ruby-lang.org/issues/18511 and brianmario/yajl-ruby#221. There's nothing definitive here yet linking yajl to the specific issues we are seeing, but since that is something that may be different on your foundations, it might be worthwhile to look into. |
Side note: With the "RAILS_ENV fix" the CPU usage on the cc-workers seem to drop significantly. |
We believe we may have found the source of a memory leak, although it is one we knew about before we upgraded to ruby 3.0. By creating a large amount of ASG's (~10000), each with 100 rules, we were able to create large memory leaks. Using heapy, we were able to see several generations, each with only 1 object, yet large memory usage:
examining one of the generations, we saw that it was indeed our large payload from the ASG endpoint:
(it also spit out the entire payload) We conducted this test with the We were skeptical that both OJ and Yajl had memory leaks. We decided to replace calls in the json encoder to After making the swap, we no longer saw the generations stuck with a single huge object. This still needs some testing, but might be promising. |
We created a draft pr to play around with removing |
Hey, |
We've cut a release containing the changes to address these memory/cpu issues: https://github.com/cloudfoundry/capi-release/releases/tag/1.136.0 |
FYI, with the latest cf-deployment; v21.9.0, which includes the capi-release with the fix, this resource leak seems to be fixed. CPU and memory use don't seem to be increasing overall. |
Great to hear--thanks for the feedback! |
closing. Please let us know if you observe any more issues! |
Issue
over time, cloud controller memory and cpu use increase, seemingly unbound.
Context
In our test environments, one which receives almost no traffic, and one which receives a lot of traffic (mostly metrics requests), we see a slow increase in memory and cpu use. A bit of output from Sar may show the scale.
from
sar -r
for memoryand from
sar
for cpuThis is on an m5.large.
[provide more detailed introduction]
Steps to Reproduce
Deploy with CAPI v1.134.0
Wait.
Restarting cloud controller brings cpu and memory use down to expected levels.
Our cloud_controller config is pretty vanilla but let us know if some settings will be interesting for this case.
Others have mentioned in slack that this seems to occur with the bump from v1.133.0 and v1.134.0 as referenced here: https://cloudfoundry.slack.com/archives/C07C04W4Q/p1660123878729779?thread_ts=1660062711.781649&cid=C07C04W4Q
Expected result
No increase in memory or cpu use over time except as related to load.
Current result
cpu and memory use increase over time even with no load.
Possible Fix
I don't have enough knowledge about cloud controller to speculate intelligently, but I will vaguely point my uneducated finger at the bump in ruby version.
name of issue
screenshot[if relevant, include a screenshot]
The text was updated successfully, but these errors were encountered: