-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major lag in SCT logging handling #7069
Comments
Not a new issue, but became quite severe recently. I know that @fruch want a proof of the root cause, but I think we both know where exactly the bottleneck is. We need to implement "client side filtering" and possibly improve / remove some regex for creating events. |
I don't think it would be enough We need a way to get metrics out of the runner node as well, to help us figure where and when we have those bottle necks, client filtering ain't gonna solve it all |
Maybe, but I'm sure that this should be the start first. |
I've started trying out the client side filtering:
one more thing we can evaluate is transfer all logs to loki, and use it's apis to filter/tail the things we are looking for. |
now that we have the runner metrics, not something new, but we now can try bigger instances as a temporary solution, and test the client filtering options, and see how much it can lower this (if it's the root cause, and not just the side effect) |
major cause of lags seem to be coming from long log lines, |
one more PR to help pinpoint log related issue: |
still considering if client side filtering is helpful, stats shows it as less impact on runner CPU |
#7128 is solving the major part of this issue |
Issue description
according disscussion with @fruch
see timestamps in:
that a major lag in our logging handling, and the failure is about finding something in the logs.
so it's lagged logging causing the issue
Installation details
Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash):
5.5.0~dev-20231227.331d9ce788e2
with build-id5a3ba5068a1b94097fb0f3fab64cdb912cff2911
Cluster size: 5 nodes (i4i.8xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0417c7525e0d98293
(aws: undefined_region)Test:
longevity-mv-si-4days-test
Test id:
eabd1c6e-ab96-4d12-9881-41c2babffc64
Test name:
scylla-master/longevity/longevity-mv-si-4days-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor eabd1c6e-ab96-4d12-9881-41c2babffc64
$ hydra investigate show-logs eabd1c6e-ab96-4d12-9881-41c2babffc64
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: