Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_controller cpu and memory use increases over time, even with no traffic. #262

Closed
pusherofbrooms opened this issue Aug 11, 2022 · 24 comments

Comments

@pusherofbrooms
Copy link

pusherofbrooms commented Aug 11, 2022

Issue

over time, cloud controller memory and cpu use increase, seemingly unbound.

Context

In our test environments, one which receives almost no traffic, and one which receives a lot of traffic (mostly metrics requests), we see a slow increase in memory and cpu use. A bit of output from Sar may show the scale.
from sar -r for memory

03:55:01 PM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
04:05:01 PM   1634524   2406676   2173960     57.08    275388    709732   3737704     49.07   1528388    467708       552
04:15:01 PM   1577284   2402476   2231200     58.58    276124    761408   3741344     49.12   1533364    519300      1004
04:25:01 PM   1635484   2397460   2173000     57.06    276868    698220   3750232     49.24   1538444    455988       384
04:35:01 PM   1598960   2344072   2209524     58.02    277540    680620   3798332     49.87   1594148    438312      1556
04:45:01 PM   1537672   2336520   2270812     59.63    278308    733124   3799872     49.89   1601708    490708      1364
04:55:01 PM   1595524   2331660   2212960     58.11    279020    670296   3814796     50.08   1607592    428032       380
05:05:01 PM   1549824   2323212   2258660     59.31    279704    706404   3833476     50.33   1616504    464008       736
05:15:01 PM   1486468   2312928   2322016     60.97    280424    758192   3839392     50.41   1627168    515712      1216
05:25:01 PM   1543828   2307492   2264656     59.46    281552    695036   3848004     50.52   1633660    452320      1856
05:35:01 PM   1499372   2300112   2309112     60.63    282324    730776   3847676     50.52   1642612    487992      1016
05:45:01 PM   1444632   2298728   2363852     62.07    283100    782524   3862244     50.71   1643200    539924      1032
05:55:01 PM   1488228   2219080   2320256     60.92    283916    659628   3928480     51.58   1726536    416936       968
06:05:01 PM   1449308   2217912   2359176     61.95    284728    696120   3933424     51.64   1727404    453332       932
06:15:01 PM   1384260   2206304   2424224     63.65    285496    748344   3961340     52.01   1737776    505472      1992
06:25:01 PM   1442784   2201244   2365700     62.12    286312    684820   3970848     52.13   1747360    441856       916
06:35:01 PM   1397116   2191776   2411368     63.32    287108    719772   3961452     52.01   1755520    476692       112
06:45:01 PM   1339620   2187352   2468864     64.83    287880    771536   3966168     52.07   1759948    528316      2248
06:55:01 PM   1398744   2183488   2409740     63.27    288608    708516   3975748     52.20   1766504    465252      2656
07:05:02 PM   1422292   2184644   2386192     62.65    289372    685116   3973960     52.17   1766040    441744      1140
07:15:01 PM   1368528   2183236   2439956     64.07    290180    736240   3978912     52.24   1766036    492748      1956
Average:       970794   1779595   2837690     74.51    303645    716346   4342606     57.01   2157278    502458      1280

and from sar for cpu

03:55:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
04:05:01 PM     all     32.06      0.02      4.56      0.02      0.02     63.33
04:15:01 PM     all     32.68      0.00      4.62      0.02      0.02     62.66
04:25:01 PM     all     33.22      0.11      4.67      0.03      0.02     61.95
04:35:01 PM     all     33.37      0.09      4.66      0.03      0.02     61.83
04:45:01 PM     all     32.90      0.00      4.71      0.02      0.02     62.34
04:55:01 PM     all     33.52      0.10      4.75      0.02      0.02     61.58
05:05:01 PM     all     33.17      0.02      4.62      0.02      0.02     62.15
05:15:01 PM     all     33.51      0.00      4.69      0.02      0.02     61.76
05:25:01 PM     all     33.79      0.11      4.68      0.03      0.02     61.37
05:35:01 PM     all     34.38      0.02      4.66      0.02      0.02     60.91
05:45:01 PM     all     34.20      0.00      4.65      0.02      0.02     61.11
05:55:01 PM     all     34.28      0.19      4.66      0.02      0.02     60.83
06:05:01 PM     all     33.65      0.02      4.56      0.02      0.02     61.73
06:15:01 PM     all     34.11      0.00      4.60      0.02      0.02     61.26
06:25:01 PM     all     34.17      0.10      4.71      0.02      0.02     60.98
06:35:01 PM     all     35.02      0.02      4.76      0.02      0.02     60.17
06:45:01 PM     all     34.66      0.00      4.65      0.02      0.02     60.65
06:55:01 PM     all     35.05      0.10      4.74      0.03      0.02     60.07
07:05:02 PM     all     35.18      0.09      4.64      0.03      0.02     60.04
07:15:01 PM     all     35.51      0.00      4.65      0.02      0.02     59.80
Average:        all     37.76      0.05      4.58      0.02      0.02     57.57

This is on an m5.large.

[provide more detailed introduction]

Steps to Reproduce

Deploy with CAPI v1.134.0

Wait.

Restarting cloud controller brings cpu and memory use down to expected levels.

Our cloud_controller config is pretty vanilla but let us know if some settings will be interesting for this case.

Others have mentioned in slack that this seems to occur with the bump from v1.133.0 and v1.134.0 as referenced here: https://cloudfoundry.slack.com/archives/C07C04W4Q/p1660123878729779?thread_ts=1660062711.781649&cid=C07C04W4Q

Expected result

No increase in memory or cpu use over time except as related to load.

Current result

cpu and memory use increase over time even with no load.

Possible Fix

I don't have enough knowledge about cloud controller to speculate intelligently, but I will vaguely point my uneducated finger at the bump in ruby version.

name of issue screenshot

[if relevant, include a screenshot]

@philippthun
Copy link
Member

We are seeing a similar behavior after updating to 1.132.0, i.e. the Ruby 3.0 update. The increase of the memory used on cc-worker VMs can be seen in the following graph:

image

This landscape receives ~35 Api requests per second.

@metskem
Copy link

metskem commented Aug 15, 2022

Same issue with us, rolling back and forward also shows that 1.134.0 introduces this cpu and memory increase.
image

@sethboyles
Copy link
Member

@metskem @pusherofbrooms are you seeing the usage increase on the API vms or workers? or both?

@philippthun are you seeing the usage increase on just the workers, or also on API vms?

@metskem
Copy link

metskem commented Aug 16, 2022

We see it only on the API vms.

@pusherofbrooms
Copy link
Author

The api instances is where we see CPU and memory leakage, specifically the cloud_controller_ng process.

@philippthun
Copy link
Member

We (when running CAPI 1.132.0 with Ruby 3.0) are only seeing this on cc-worker VMs.

We did some local tests (now with Ruby 3.1) and profiled the worker process with rbspy (for 5 minutes). This is the summary:

% self  % total  name
 97.66    97.66  sleep [c function] - (unknown)
  0.76     1.76  block in <top (required)> - /Users/me/SAPDevelop/github_com/capi-release/src/cloud_controller_ng/config/routes.rb
  0.26     0.26  [] - <internal:dir>
  0.15     0.26  visit - /Users/me/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/visitors.rb
  0.10     0.10  exec [c function] - (unknown)
  0.09     0.20  _racc_do_parse_c [c function] - (unknown)
  0.08     0.36  new [c function] - (unknown)
  0.08     0.08  mtime [c function] - (unknown)
  0.05     0.26  each - /Users/me/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/nodes/node.rb
  0.05     0.08  find [c function] - (unknown)

As there are no jobs, the worker is sleeping most of the time. But besides that it's constantly reloading the ccng routes!? The worker does not need the routes at all, right?

I also used heap-profiler on the bigger environment (also for 5 minutes) and heapy to analyze the retained heap dump:

object count (35334)
==============================
  1844  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/activesupport-6.1.5.1/lib/active_support/dependencies.rb:332
  1477  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/parser.rb:138
   929  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/activesupport-6.1.5.1/lib/active_support/file_update_checker.rb:130
   925  <internal:dir>:44
   646  <internal:kernel>:49
   616  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/parser.rb:178
   504  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/visitors.rb:24
   504  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/visitors.rb:25
   502  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/visitors.rb:139
   462  /var/vcap/data/packages/cloud_controller_ng/cf73df8f7fa0f2398bbc6a0da417d329847f3470/gem_home/ruby/3.0.0/gems/actionpack-6.1.5.1/lib/action_dispatch/journey/visitors.rb:19

So the retained memory seems to be related to this (unnecessary) route reloading.

To stop this constant reloading, one can set RAILS_ENV="production" which is done for the api processes, but not for the workers. This changes the rbspy summary to:

% self  % total  name
 99.86    99.86  sleep [c function] - (unknown)
  0.07     0.07  exec [c function] - (unknown)
  0.01     0.09  log_connection_yield - /Users/me/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/sequel-5.58.0/lib/sequel/database/logging.rb
  0.01     0.01  reload! - /Users/me/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/delayed_job-4.1.9/lib/delayed/worker.rb
  0.01     0.01  literal_append - /Users/me/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/sequel-5.58.0/lib/sequel/dataset/sql.rb
  0.01     0.01  yield_hash_rows [c function] - (unknown)
  0.01     0.01  caller [c function] - (unknown)
  0.00     0.12  synchronize - /Users/me/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/sequel-5.58.0/lib/sequel/database/connecting.rb
  0.00     0.06  first - /Users/me/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/sequel-5.58.0/lib/sequel/dataset/actions.rb
  0.00     0.00  strftime [c function] - (unknown)

Although this could mitigate the problem, it still does not explain why memory is increasing since Ruby 3 (route reloading was also done before). And it is also not applicable to the api processes where others are seeing increasing memory consumption.

@sethboyles
Copy link
Member

We also noticed the workers reloading due to the RAILS_ENV not being set correctly. We should make the change, but we agree that it doesn't adequately explain the issue.

@moleske
Copy link
Member

moleske commented Aug 16, 2022

We're still looking at this, but we've noticed a discrepancy between heap-profiler and what monit status is reporting. @philippthun Have you seen this difference or been able to reconcile the difference?

johha added a commit that referenced this issue Aug 17, 2022
`RAILS_ENV` should be set  to `production` to prevent rails/active support from reloading routes. In context of a cc worker this feature is not needed because there are no routes.

Related to #262
@johha
Copy link
Contributor

johha commented Aug 17, 2022

Regarding the issue on the API nodes: Do we know if the issue is occurring on cloud_controller_ng or the local_worker process?

johha added a commit that referenced this issue Aug 17, 2022
On local workers `RAILS_ENV` should be set  to `production` to prevent rails/active support from reloading routes. In context of a cc worker this feature is not needed because there are no routes.

Same as #263 for dedicated workers.

Related to #262
@KevinJCross
Copy link

Hi we on the @cloudfoundry/cf-autoscaler team are seeing this happen very quickly in our acceptance test environment.
within a couple days are are seeing it reach max heap and a couple more and a large protion of the normal cf calls start timing out after 30s. We currently have 4 api instances running to deal with our acceptance tests (seems like overkill to us) but we are getting this happening after a few days to a week.
We can grap a heap dump or such if you can tell us what to do (I know very little about ruby Vm's).

@pusherofbrooms
Copy link
Author

greetings @johha
Here is the process list sorted by memory use

api/c6b71950-8b34-48d2-ba50-74e7573bbc37:~$ ps aux --sort -rss | head -10
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
vcap     12375 11.1 20.2 1017576 770528 ?      S<   Aug16 135:46 /var/vcap/packages/cloud_controller_ng/gem_home/ruby/3.1.0/bin/rake jobs:local[cc_api_worker.api.1.1]
vcap     12470 11.1 20.2 1017144 769484 ?      R<   Aug16 135:57 /var/vcap/packages/cloud_controller_ng/gem_home/ruby/3.1.0/bin/rake jobs:local[cc_api_worker.api.1.2]
vcap     12141 37.9 12.3 1582872 469516 ?      R<l  Aug16 463:12 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
vcap     14342  0.4  0.6 1314788 23936 ?       S<l  Aug16   5:28 /var/vcap/packages/bosh-dns/bin/bosh-dns --config /var/vcap/jobs/bosh-dns/config/config.json
vcap     13870  1.5  0.5 1240720 21552 ?       S<l  Aug16  18:25 /var/vcap/packages/syslog-agent/syslog-agent
vcap     13661  1.2  0.5 1165220 21288 ?       S<l  Aug16  14:38 /var/vcap/packages/loggregator_agent/loggregator-agent
vcap     13764  4.1  0.5 1239792 20812 ?       S<l  Aug16  50:50 /var/vcap/packages/forwarder-agent/forwarder-agent
vcap     14180  2.7  0.5 1239192 20684 ?       S<l  Aug16  33:39 /var/vcap/packages/metrics-agent/metrics-agent
vcap     13002  0.0  0.5 1315936 19848 ?       S<l  Aug16   1:06 /var/vcap/packages/routing-api/bin/routing-api -config /var/vcap/jobs/routing-api/config/routing-api.yml -logLevel info -timeFormat rfc3339 -ip 10.10.82.0

So workers seem to be top memory users, but the cloud_controller_ng process seems to be the top cpu user.

@philippthun
Copy link
Member

@moleske No, we did not look at monit status. Can you provide more details on what differences you are seeing?

@philippthun
Copy link
Member

@pusherofbrooms Okay, so the memory issue might go away with the change to RAILS_ENV for the local workers (#264).

To understand what the ccng process is doing, could you run rbspy for a while and provide us the results? Here is a short how-to guide:

bosh -d cf ssh api/0

wget -O rbspy.tar.gz "https://github.com/rbspy/rbspy/releases/download/v0.12.1/rbspy-x86_64-musl.tar.gz"
tar -xf rbspy.tar.gz
rm rbspy.tar.gz
mv rbspy-* rbspy
chmod +x rbspy

sudo ./rbspy record --pid $(pgrep -f "ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller") --silent --duration 60 --file summary.txt --format summary --raw-file raw.gz

Maybe you can also change the duration to e.g. 300 (seconds) to get some more data.

@sethboyles
Copy link
Member

sethboyles commented Aug 17, 2022

@philippthun what we were observing (on the API process) was that heap-profiler would report around 225MB of objects in its memory, but monit status would report that cloud_controller_ng was using ~2GB of memory. We weren't clear on what would account for the discrepancy.

@philippthun
Copy link
Member

@sethboyles We didn't really look at the absolute values reported by heap-profiler, but only the ordered lists to find suspicious objects.

We've set the correct RAILS_ENV on one of the cc-worker VMs and the results looks promising (the "yellow" worker is the patched one):
cc-workers

@sethboyles
Copy link
Member

@philippthun we find it curious that you all are not experiencing issues with the API process like other are--is that correct?

Do you think it might be related to how you all have use_optimized_json_encoder enabled?

cc.experimental.use_optimized_json_encoder:
description: "Switch to a different Ruby JSON encoder, i.e. Oj (https://github.com/ohler55/oj)."
default: false

We are thinking that maybe there is an issue with yajl, which is what CCNG uses by default, and in our heap profiling we did see yajl taking up a lot of memory. We also found a few issues related to yajl's performance: https://bugs.ruby-lang.org/issues/18511 and brianmario/yajl-ruby#221. There's nothing definitive here yet linking yajl to the specific issues we are seeing, but since that is something that may be different on your foundations, it might be worthwhile to look into.

@philippthun
Copy link
Member

@sethboyles

  • We are also seeing some increase in memory usage on the api VMs; this is (mostly?) caused by the local worker processes. In general it's less obvious than the increase on the worker VMs (each running 8 worker processes btw.); that's why we did not look into this in detail so far.
  • We definitely do not see this constant CPU usage increase that others are experiencing.
  • We are still on Ruby 3.0, not 3.1.
  • Yes, we are using the "optimized json encoder" on all our foundations.

image

image

Side note: With the "RAILS_ENV fix" the CPU usage on the cc-workers seem to drop significantly.

@sethboyles
Copy link
Member

sethboyles commented Aug 17, 2022

We believe we may have found the source of a memory leak, although it is one we knew about before we upgraded to ruby 3.0.

By creating a large amount of ASG's (~10000), each with 100 rules, we were able to create large memory leaks.

Using heapy, we were able to see several generations, each with only 1 object, yet large memory usage:

Analyzing Heap                                                                                                                                                                             ==============
Generation: nil object count: 280536, mem: 0.0 kb                                                                                                                                          Generation:  45 object count: 8240, mem: 1793.1 kb
Generation:  46 object count: 5160, mem: 1132.0 kb                                                                                                                                         Generation:  47 object count: 6744, mem: 1091.8 kb
Generation:  48 object count: 6670, mem: 915.9 kb                                                                                                                                          Generation:  49 object count: 8020, mem: 1528.2 kb
Generation:  50 object count: 20581, mem: 3976.4 kb                                                                                                                                        Generation:  51 object count: 33500, mem: 7375.5 kb
Generation:  52 object count: 21390, mem: 5777.1 kb                                                                                                                                        Generation:  53 object count: 63445, mem: 9622.6 kb
Generation:  54 object count: 8561, mem: 1595.0 kb                                                                                                                                         Generation:  55 object count: 294, mem: 17.5 kb
Generation:  58 object count: 6, mem: 2.1 kb                                                                                                                                               Generation:  61 object count: 6, mem: 2.1 kb
Generation:  62 object count: 6, mem: 2.1 kb                                                                                                                                               Generation:  63 object count: 2, mem: 0.7 kb
Generation:  64 object count: 4, mem: 1.7 kb
Generation:  66 object count: 6, mem: 2.0 kb
Generation:  67 object count: 28, mem: 4.1 kb
Generation:  68 object count: 6, mem: 2.0 kb
Generation:  70 object count: 1, mem: 64561.5 kb
Generation:  71 object count: 421, mem: 38.0 kb
Generation:  72 object count: 5, mem: 0.6 kb
Generation:  73 object count: 12, mem: 4.1 kb
Generation:  75 object count: 12, mem: 4.1 kb
Generation:  76 object count: 2, mem: 0.5 kb
Generation:  77 object count: 10, mem: 4.9 kb
Generation:  79 object count: 1, mem: 64561.5 kb
Generation:  80 object count: 852, mem: 146.1 kb
Generation:  81 object count: 41, mem: 5.1 kb
Generation:  82 object count: 39, mem: 7.3 kb

examining one of the generations, we saw that it was indeed our large payload from the ASG endpoint:

allocated by memory (61079397) (in bytes)
==============================
  61079397  /var/vcap/data/packages/cloud_controller_ng/46a8cc86e43a32e41d5b5e91e6983e75542ed547/gem_home/ruby/3.1.0/gems/multi_json-1.15.0/lib/multi_json/adapters/oj.rb:56

object count (1)
==============================
  1  /var/vcap/data/packages/cloud_controller_ng/46a8cc86e43a32e41d5b5e91e6983e75542ed547/gem_home/ruby/3.1.0/gems/multi_json-1.15.0/lib/multi_json/adapters/oj.rb:56

(it also spit out the entire payload)

We conducted this test with the use_optimized_json_encoder flag enabled. However, the general memory pattern was reproducible when not using the flag.

We were skeptical that both OJ and Yajl had memory leaks. We decided to replace calls in the json encoder to MultiJson.dump to Oj.dump

https://github.com/cloudfoundry/cloud_controller_ng/blob/2ab201a7c53716cff7c23c35f1cb33ed34f8c517/config/initializers/json.rb#L27-L31

After making the swap, we no longer saw the generations stuck with a single huge object.

This still needs some testing, but might be promising.

@moleske
Copy link
Member

moleske commented Aug 18, 2022

We created a draft pr to play around with removing MultiJson.dump in json.rb. There's still some issues to resolve (like MultiJson being called in lots of places) but we wanted to share early to get some feedback and if other people would like to test it out to see if it resolves any issues for them

@KevinJCross
Copy link

KevinJCross commented Aug 24, 2022

Hey,
We have a couple things that help this problem now and will probably be released.
Can we make a patch release to improve the situation for those who are on this release.

@sethboyles
Copy link
Member

We've cut a release containing the changes to address these memory/cpu issues: https://github.com/cloudfoundry/capi-release/releases/tag/1.136.0

@pusherofbrooms
Copy link
Author

pusherofbrooms commented Sep 15, 2022

FYI, with the latest cf-deployment; v21.9.0, which includes the capi-release with the fix, this resource leak seems to be fixed. CPU and memory use don't seem to be increasing overall.

@sethboyles
Copy link
Member

Great to hear--thanks for the feedback!

@sethboyles
Copy link
Member

closing. Please let us know if you observe any more issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants