-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataplane part of forwarder-vpp
leaks
#1120
Comments
Provided by @szvincze |
|
Current stateIt looks like Next steps
Total: 35h |
/cc @szvincze |
Current status
Controlplane problems:
Dataplane problems:
|
Hi, @NikitaSkrynnik :Could you please help our investigation with the Dataplane Prblems you mentioned above and point to VPP? I loaded VPP (v23.10-rc0 local build) API with vxlan tunnel creates, tap creates, ACL xConnect create in batch for ~24 hours and was not able to reproduce such issue. I tried to use the same go API calls that are used by the cmd-forwarder-vpp. |
Hi, @elajkat! Unfortunately we didn't have enough time to reproduce the problem on VPP. You can try to reproduce it using NSM and check VPP which runs inside a forwader. Here are the steps:
After repeated scaling up/down I can observe several |
Hi, with the suggested scale up - scale down process I got some hanging interfaces, but those are in up state, i.e.:
Next week I check this in detail with VPP API trace and forwarder logs etc. and try reproduce this without forwarder with pure VPP and API calls. |
You may find the vpp api trace helpful for capturing the sequence of API calls that leads to the issue for handoff to VPP folks for reproduction. |
I also played around with a slightly modified reproduction. I was able to reliably reproduce the leftover tun/vxlan_tunnel interfaces. At this time I only focused on identifying the suggested vpp bug. However I found no indication in vpp's api trace, that we have a vpp bug here. Here's what I did and saw. The environment:
The reproduction loop:
In the above repro steps it slightly bothered me (due to my weak kubernetes-fu) that we do not wait for the completion of a scale operation (other than sleeping for a hardcoded period) before we start a new scale operation. Therefore I replaced the sleeps with kubectl wait commands (and a final sleep just to make sure) as you can see below:
(note: Please correct me if you think these kubectl wait commands to be incorrect.) With this method I could reliably produce leftover interfaces. Usually already in the 1st or 2nd iteration of the reproduction loop we have some. Always the same number of taps as vxlan_tunnels. Most of the time with the same suffix (like tap3 and vxlan_tunnel3), but sometimes with different suffixes (like tap0 and vxlan_tunnel2). These interfaces are always up. Out of 20 runs leaving interfaces around, I have seen 20 in which the vpp api trace contains no indication that anybody ever tried to delete the leftover interfaces: For tap interfaces I could grep for the tap_delete message with a concrete sw_if_index, e.g.:
For vxlan_tunnel interfaces unfortunately the vpp api trace contains only identifiers that are harder to pair with sw_if_index, therefore I only counted the number of vxlan_tunnel creates and deletes - looking for having fewer deletes than creates:
The difference between creates and deletes were the number of vxlan_tunnel interfaces left around in all 20 test runs. In summary I could not confirm the presence of a vpp bug - or at least my reproduction method does not reproduce it. By the way the fact that the leftover interfaces are always present in tap+vxlan_tunnel pairs also makes me think its cause is more likely to be in nsm than vpp. If anybody is interested I'm happy to share the scipts I used. Let me know if you have other ideas to refine the reproduction steps. Let me know what you think! |
Hello @rubasov, thank you for testing it. We're planning to continue working on this issue soon. As I recall, the main issue is revealed only to leaked vpp infaces with state up. At this moment, it could be helpful if you shared the scripts, and then we could verify steps to reproduce. UPD: meant state: DOWN |
Modify the kernel2ethernet2kernel case: * Remove the antiaffinity, so we'll have k2e2k mixed with k2k. * Wrap the alpine pod in a deployment, so we can scale it. * Extend NSM_CIDR_PREFIX, so we have enough IPs to scale. Signed-off-by: Bence Romsics <[email protected]>
Hi @denis-tingaikin, I made some modifications and now I believe I have a better reproduction - hopefully reproducing every bug symptom Nikita described. I'm attaching the scripts I used. My earlier approach (as in my previous comment) always stopped after the first leftover interface was detected (no matter if it was up or down). However these interfaces were always up and after a few minutes they got deleted, therefore it's not likely they could contribute to a long term memory leak. The current approach does not stop at the first leftover interface, but continues scaling the nsc deployment up and down. Usually after many more iterations (up to a 100, sometimes 150) this approach produces leftover interfaces both up and down (many up, a few down), out of which a few remains indefinitely (the down interfaces always remain). With this complex reproduction I'm far from having a proper understanding of what's happening, however the following factors seem to be relevant:
Regarding the scripts. Please read them before running. First run:
setup-1.sh creates a kind cluster. I used it without having other kind clusters around. Later runs:
|
forwarder-vpp
leaks
Hi @rubasov, @elajkat. I collected the traces for the issue with Here are the scripts I use: iface-leaks.zip. Just run |
Hi @denis-tingaikin, @NikitaSkrynnik, While @ljkiraly is out of office we're trying to track the status of this interface leak. We have seen networkservicemesh/sdk#1650 merged. The pull request description says it should fix all interface leaks. However Nikita's comment in networkservicemesh/sdk#1649 reads to me as if the merged patch was only part of a planned fix. We also built a new forwarder with the sdk patch merged and tested it with the reproduction method we discussed earlier. Unfortunately we still see leftover interfaces. Can you please help us understand what exactly the merged patch fixes? Did we maybe miss another patch we should have used for a full fix? |
Hi, @rubasov. We found out that we need two more patches to fix all interface leaks: |
@rubasov Many thanks for testing it. Currently, we also looking for other solutions, it would be perfect if you could add these two patches #1120 (comment) (note: requires SDK with commit 301631365421a9d916a5ab7224fa9fbd847320b2) and test it. |
Hi @NikitaSkrynnik, @denis-tingaikin, I did some more testing and wanted to get back to you with the results: I started with sdk commit 78bedfb4 (which already includes commit 30163136 "Add a timeout for Closes in begin.Server (#1650)"). I had to resolve one merge conflict between https://github.com/networkservicemesh/sdk/pull/1656/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R45 and https://github.com/networkservicemesh/sdk/pull/1655/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R46. Then also noticed that the s/closeTimeout/contextTimeout/ rename had to be applied here too: Using this sdk I built a new forwarder and ran the usual reproduction steps. Unfortunately I still saw leftover interfaces as before. However seeing these conflicts between PR 1656 and 1655 I'm wondering if I'm really testing with the same code as you are. It feels like I'm still missing something. Let me know what you think. |
@rubasov Could you try this forwarder image: |
Hi @NikitaSkrynnik, Thanks for the image! I started the repro script (which is still basically the same as I posted earlier) using your forwarder image. It seems to be working. Earlier, without the fixes, I believe the maximum iteration count I ever saw needed for an interface left in down was between 160-170. With the fixes included my repro is at iteration 670. And still no interface left in down. So I believe this at least significantly reduced the frequency or hopefully fully fixed it. I'm interested in how this image was built differently from my last attempt. If you have the time, please share! Thanks, |
Hi @NikitaSkrynnik, I wanted to get back after we have done some longer test runs taking the main branch after all your fixes merged. We still occasionally see some leftover interfaces. But it usually takes 500-1000 nsc scale-up-down iterations (with the usual repro scripts) until the error happens once. I believe this is 1-1.5 magnitudes better than without the fixes. At this moment we hope this is good enough even for long-lived deployments. Of course if we learn anything new, we'll get back to you. Thank you for your help! |
Statistics from rc2.
|
We plan to revert the begin chain element networkservicemesh/sdk#1668 when Netlink is released: vishvananda/netlink#1019 |
Several weeks back while working with the interface leak problem I also wanted to understand the link between the interface leak and the memory leak. Yesterday talking to Laszlo and Szilard we realized I never shared the results of this. This has practically nothing to do with nsm itself, however it still could be useful in understanding how the interface leak casued the memory leak, so I summarize it here. The first step was to realize that the main problem were the leftover tap interfaces. Knowing this I no longer needed the complicated and long running reproduction scripts we used in other parts of the analysis. Instead I could just create tap interfaces in bulk and observe the memory usage. I created a test vm (under libvirt/kvm) and installed vpp into it.
We can see that both the vpp self-reported memory usage and the kernel's report about the vpp process itself shows cc. 30 KiB / tap interface memory usage - which is negligibly low. At this rate we are never running out of today's typical container memory limits (usually hundreds of MiBs of free memory). However the overall used memory in this vm grows much faster. To show the typical values and the variance here are the results of 5 runs: 6697 KiB/tap This is more than 200x times what the vpp process itself uses per tap interface. Where does this memory go? I believe this discrepancy was also observed in a k8s environment between the cgroups reported overall memory usage and the vpp process's memory usage. In this simpler environment it's already clear that the extra memory must be used by the linux kernel. Let's find some positive proof of that! We'll look into /proc/slabinfo where the kernel tracks several kinds of internal allocation statistics (https://man7.org/linux/man-pages/man5/slabinfo.5.html).
That means we already found 90% of the to-be-explained memory usage in the allocation statistics of the kmalloc-8k slabs. At this point I was convinced I had positive proof that the extra memory is used by the kernel and did not look into the changes of other slab types (like kmallocs of different units). But why does a tap interface use this much memory? Probably because vpp creates multiqueue tap interfaces (I believe for performance reasons on multicore cpus) and all the buffers allocated are multiplied by the number of queues. The relevant code is here: I shortly looked into creating the same tap interface setup by the ip and tc tools. I started off like this:
However now I believe that not all tunables can be set via these tools, so I could not create the exact same thing this way. If somebody is interested, the tap creation code could be extracted from vpp. Please also note that this memory could appear as if used by vpp - because the tap interfaces are nonpersistent tap interfaces. Which means that if the vpp process dies then the linux kernel automatically destroys all tap interfaces requested by vpp. That's why I did not need to delete the taps in the above examples, I could just kill vpp instead. |
We can use nitems parameter in
|
Just wanted to comment it here that we started the same measurements with limited nitems, so big +1 for it. |
In my local env I started a test with vpp 24.10, nitems=1000 (to be small enough to reach the saturation quick enough), my env is in kind, 2 forwarders. 2024-11-26 14:28:04.439240278+01:00 ..... 2024-11-26 16:21:11.986159040+01:00 Sorry I have no time to parse the raw outputs to something easier to consume, or to a chart. |
I used value 4096 |
I believe that setting the default value of nitems to some fair, low number is perfect, the admins/operators can set anytime the value in vpp cli: The value must be documented with some help for the admins how to change it. |
Hi, @szvincze! We don't see any memory leaks on our side. Could you tell about your test results? Is everything stable on your side? Can we close this issue? |
Hi @NikitaSkrynnik, we will have a weekend test that covers this issue also (with decreased nitems value for VPP), after that we will have results from our labs. Thanks |
forwarder-vpp_goroutineprofiles_20240527074108.tar.gz
forwarder-vpp_memprofiles_20240527074153.tar.gz
TODO
The text was updated successfully, but these errors were encountered: