Major lag in SCT logging handling #7069

temichus · 2024-01-05T10:20:36Z

Issue description

according disscussion with @fruch

see timestamps in:

2023-12-30 13:56:55.714 &lt;2023-12-30 13:12:48.079&gt;: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=39321501-78b3-44a2-bb21-8090ef2adee6: type=NO_SPACE_ERROR regex=No space left on device line_number=177057 node=longevity-mv-si-4d-master-db-node-eabd1c6e-2
2023-12-30T13:12:48.079+00:00 longevity-mv-si-4d-master-db-node-eabd1c6e-2      !ERR | scylla[29039]:  [shard  7:  mt] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

that a major lag in our logging handling, and the failure is about finding something in the logs.
so it's lagged logging causing the issue

Installation details

Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 5.5.0~dev-20231227.331d9ce788e2 with build-id 5a3ba5068a1b94097fb0f3fab64cdb912cff2911

Cluster size: 5 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

longevity-mv-si-4d-master-db-node-eabd1c6e-9 (34.242.82.15 | 10.4.9.172) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-8 (18.201.123.249 | 10.4.11.11) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-7 (54.78.196.225 | 10.4.11.226) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-6 (34.244.190.233 | 10.4.10.202) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-5 (34.240.194.88 | 10.4.9.118) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-4 (18.201.52.162 | 10.4.8.189) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-3 (54.194.161.65 | 10.4.10.99) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-2 (3.248.206.144 | 10.4.9.136) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-12 (34.244.230.11 | 10.4.8.137) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-11 (34.253.194.121 | 10.4.8.24) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-10 (54.171.73.99 | 10.4.10.112) (shards: 30)
longevity-mv-si-4d-master-db-node-eabd1c6e-1 (18.201.155.89 | 10.4.11.109) (shards: 30)

OS / Image: ami-0417c7525e0d98293 (aws: undefined_region)

Test: longevity-mv-si-4days-test
Test id: eabd1c6e-ab96-4d12-9881-41c2babffc64
Test name: scylla-master/longevity/longevity-mv-si-4days-test
Test config file(s):

longevity-mv-si-4days.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor eabd1c6e-ab96-4d12-9881-41c2babffc64
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs eabd1c6e-ab96-4d12-9881-41c2babffc64

Logs:

db-cluster-eabd1c6e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/eabd1c6e-ab96-4d12-9881-41c2babffc64/20231231_071720/db-cluster-eabd1c6e.tar.gz
sct-runner-events-eabd1c6e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/eabd1c6e-ab96-4d12-9881-41c2babffc64/20231231_071720/sct-runner-events-eabd1c6e.tar.gz
sct-eabd1c6e.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/eabd1c6e-ab96-4d12-9881-41c2babffc64/20231231_071720/sct-eabd1c6e.log.tar.gz
loader-set-eabd1c6e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/eabd1c6e-ab96-4d12-9881-41c2babffc64/20231231_071720/loader-set-eabd1c6e.tar.gz
monitor-set-eabd1c6e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/eabd1c6e-ab96-4d12-9881-41c2babffc64/20231231_071720/monitor-set-eabd1c6e.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

roydahan · 2024-01-08T18:22:46Z

Not a new issue, but became quite severe recently.

I know that @fruch want a proof of the root cause, but I think we both know where exactly the bottleneck is.

We need to implement "client side filtering" and possibly improve / remove some regex for creating events.
We shall also open issues in Scylla to reduce severity of some of the current ERRORs so we won't process them as events.

fruch · 2024-01-08T18:34:00Z

Not a new issue, but became quite severe recently.

I know that @fruch want a proof of the root cause, but I think we both know where exactly the bottleneck is.

We need to implement "client side filtering" and possibly improve / remove some regex for creating events.
We shall also open issues in Scylla to reduce severity of some of the current ERRORs so we won't process them as events.

I don't think it would be enough

We need a way to get metrics out of the runner node as well, to help us figure where and when we have those bottle necks, client filtering ain't gonna solve it all

roydahan · 2024-01-08T19:04:30Z

Maybe, but I'm sure that this should be the start first.
I don't need the metrics to know it's related to that.

fruch · 2024-01-18T07:27:57Z

I've started trying out the client side filtering:

feature(syslogng): move all default filters to client side #7041
we still need to review the whole list of db event we have, some part of it might be able to move into client side filter
I think we might consider to not generate events for warnings, we have quite a lot of them in some cases, and we don't really use it, or need it in run-time. any warning related handling of db logs, should be post-processing. we introduce it just to make sure we are not generating error out of warning messages (regardless what is the text in them).
reading of multi line backtraces generate extra quite wide regex we are doing, if all were one-liner, we can identify it much quicker, since we won't do that on every single line of log

one more thing we can evaluate is transfer all logs to loki, and use it's apis to filter/tail the things we are looking for.
it would offload all of the work to monitor instance. I'll open a task with the details of such a thing (and some more information of possible steps in such a thing), it doesn't contradict client side filtering, and getting rid of old events filter we don't need anymore

fruch · 2024-01-21T10:09:15Z

now that we have the runner metrics,
we can clearly see runner is at 100% for a long time, cause this lag

not something new, but we now can try bigger instances as a temporary solution, and test the client filtering options, and see how much it can lower this (if it's the root cause, and not just the side effect)

fruch · 2024-01-30T11:45:59Z

major cause of lags seem to be coming from long log lines,
and was addressed in:

fix(tier1): longevity-schema-topology-changes-12h bigger runner #7128

fruch · 2024-01-30T11:47:19Z

one more PR to help pinpoint log related issue:

feature(syslog-ng): collect syslog-ng stats #7121

fruch · 2024-01-30T11:48:18Z

still considering if client side filtering is helpful, stats shows it as less impact on runner CPU

feature(syslogng): move all default filters to client side #7041

fruch · 2024-02-12T08:41:45Z

#7128 is solving the major part of this issue
closing for now, until we hit this

temichus assigned fruch Jan 5, 2024

github-actions bot assigned temichus Jan 5, 2024

temichus removed their assignment Jan 5, 2024

fruch closed this as completed Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major lag in SCT logging handling #7069

Major lag in SCT logging handling #7069

temichus commented Jan 5, 2024

Logs:

roydahan commented Jan 8, 2024

fruch commented Jan 8, 2024

roydahan commented Jan 8, 2024

fruch commented Jan 18, 2024

fruch commented Jan 21, 2024

fruch commented Jan 30, 2024

fruch commented Jan 30, 2024

fruch commented Jan 30, 2024

fruch commented Feb 12, 2024

Major lag in SCT logging handling #7069

Major lag in SCT logging handling #7069

Comments

temichus commented Jan 5, 2024

Issue description

Installation details

Logs:

roydahan commented Jan 8, 2024

fruch commented Jan 8, 2024

roydahan commented Jan 8, 2024

fruch commented Jan 18, 2024

fruch commented Jan 21, 2024

fruch commented Jan 30, 2024

fruch commented Jan 30, 2024

fruch commented Jan 30, 2024

fruch commented Feb 12, 2024