Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major lag in SCT logging handling #7069

Closed
temichus opened this issue Jan 5, 2024 · 9 comments
Closed

Major lag in SCT logging handling #7069

temichus opened this issue Jan 5, 2024 · 9 comments
Assignees

Comments

@temichus
Copy link
Contributor

temichus commented Jan 5, 2024

Issue description

according disscussion with @fruch

see timestamps in:

2023-12-30 13:56:55.714 <2023-12-30 13:12:48.079>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=39321501-78b3-44a2-bb21-8090ef2adee6: type=NO_SPACE_ERROR regex=No space left on device line_number=177057 node=longevity-mv-si-4d-master-db-node-eabd1c6e-2
2023-12-30T13:12:48.079+00:00 longevity-mv-si-4d-master-db-node-eabd1c6e-2      !ERR | scylla[29039]:  [shard  7:  mt] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

that a major lag in our logging handling, and the failure is about finding something in the logs.
so it's lagged logging causing the issue

Installation details

Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 5.5.0~dev-20231227.331d9ce788e2 with build-id 5a3ba5068a1b94097fb0f3fab64cdb912cff2911

Cluster size: 5 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

  • longevity-mv-si-4d-master-db-node-eabd1c6e-9 (34.242.82.15 | 10.4.9.172) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-8 (18.201.123.249 | 10.4.11.11) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-7 (54.78.196.225 | 10.4.11.226) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-6 (34.244.190.233 | 10.4.10.202) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-5 (34.240.194.88 | 10.4.9.118) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-4 (18.201.52.162 | 10.4.8.189) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-3 (54.194.161.65 | 10.4.10.99) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-2 (3.248.206.144 | 10.4.9.136) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-12 (34.244.230.11 | 10.4.8.137) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-11 (34.253.194.121 | 10.4.8.24) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-10 (54.171.73.99 | 10.4.10.112) (shards: 30)
  • longevity-mv-si-4d-master-db-node-eabd1c6e-1 (18.201.155.89 | 10.4.11.109) (shards: 30)

OS / Image: ami-0417c7525e0d98293 (aws: undefined_region)

Test: longevity-mv-si-4days-test
Test id: eabd1c6e-ab96-4d12-9881-41c2babffc64
Test name: scylla-master/longevity/longevity-mv-si-4days-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor eabd1c6e-ab96-4d12-9881-41c2babffc64
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs eabd1c6e-ab96-4d12-9881-41c2babffc64

Logs:

Jenkins job URL
Argus

@temichus temichus removed their assignment Jan 5, 2024
@roydahan
Copy link
Contributor

roydahan commented Jan 8, 2024

Not a new issue, but became quite severe recently.

I know that @fruch want a proof of the root cause, but I think we both know where exactly the bottleneck is.

We need to implement "client side filtering" and possibly improve / remove some regex for creating events.
We shall also open issues in Scylla to reduce severity of some of the current ERRORs so we won't process them as events.

@fruch
Copy link
Contributor

fruch commented Jan 8, 2024

Not a new issue, but became quite severe recently.

I know that @fruch want a proof of the root cause, but I think we both know where exactly the bottleneck is.

We need to implement "client side filtering" and possibly improve / remove some regex for creating events.
We shall also open issues in Scylla to reduce severity of some of the current ERRORs so we won't process them as events.

I don't think it would be enough

We need a way to get metrics out of the runner node as well, to help us figure where and when we have those bottle necks, client filtering ain't gonna solve it all

@roydahan
Copy link
Contributor

roydahan commented Jan 8, 2024

Maybe, but I'm sure that this should be the start first.
I don't need the metrics to know it's related to that.

@fruch
Copy link
Contributor

fruch commented Jan 18, 2024

I've started trying out the client side filtering:

  • feature(syslogng): move all default filters to client side #7041

  • we still need to review the whole list of db event we have, some part of it might be able to move into client side filter

  • I think we might consider to not generate events for warnings, we have quite a lot of them in some cases, and we don't really use it, or need it in run-time. any warning related handling of db logs, should be post-processing. we introduce it just to make sure we are not generating error out of warning messages (regardless what is the text in them).

  • reading of multi line backtraces generate extra quite wide regex we are doing, if all were one-liner, we can identify it much quicker, since we won't do that on every single line of log

one more thing we can evaluate is transfer all logs to loki, and use it's apis to filter/tail the things we are looking for.
it would offload all of the work to monitor instance. I'll open a task with the details of such a thing (and some more information of possible steps in such a thing), it doesn't contradict client side filtering, and getting rid of old events filter we don't need anymore

@fruch
Copy link
Contributor

fruch commented Jan 21, 2024

now that we have the runner metrics,
we can clearly see runner is at 100% for a long time, cause this lag

image

not something new, but we now can try bigger instances as a temporary solution, and test the client filtering options, and see how much it can lower this (if it's the root cause, and not just the side effect)

@fruch
Copy link
Contributor

fruch commented Jan 30, 2024

major cause of lags seem to be coming from long log lines,
and was addressed in:

@fruch
Copy link
Contributor

fruch commented Jan 30, 2024

one more PR to help pinpoint log related issue:

@fruch
Copy link
Contributor

fruch commented Jan 30, 2024

still considering if client side filtering is helpful, stats shows it as less impact on runner CPU

@fruch
Copy link
Contributor

fruch commented Feb 12, 2024

#7128 is solving the major part of this issue
closing for now, until we hit this

@fruch fruch closed this as completed Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants