-
Notifications
You must be signed in to change notification settings - Fork 96
Job Log failures analysis
-
open job log by less latest/job.log
-
For large log files, in order to reduce noise, you may want to use the following command to open the file:
'grep -v "compaction -" job.log | grep -v "LeveledManifest -" | grep -v "Timed out waiting for server response"| grep -v "stream_session -" | grep -v "range_streamer -" | less'
-
In case you would like to grep out also c-s output, you can add:
grep -v "0, 0,"
-
-
skip to the end of big file by G, (Ctrl + c if it's stuck in calculate line number)
-
search upward by N, keywords: 'Cleaning up' you can see main fail reason (backtrace) here
In many cases, something bad happened during one nemesis and the test proceeds to the next nemesis. As soon as the test reaches to the next "harmful" nemesis, the running load may crash. Therefore, in such cases we need to look for the "last good state of the cluster", then search for the nemesis who firstly put the cluster in the degraded state. For example, if a nemesis like "TerminateAndReplace" failed to add a replacement node, the cluster will have one node in "DN" state. So, searching the last good state of this node (searching " UN ip_address" for example from the end of the file), and then searching for the nemesis that took place after this known good state (by searching for "Set current"). Another good option is to search for "Traceback" or "traceback", that will show python exception during the test.
less latest/job.log|grep -i 'Found coredump'
less latest/job.log |grep -i 'Reactor stall'
There is a tool from Yulia, it analysis the backtrace and generates a decoded unique stall backtrace. Very helpful for huge Backtrace message. https://github.com/scylladb/scylla-cluster-tests/pull/751
# Reactor stall number
less latest/job.log |grep 'Reactor stall' |wc -l
# Backtrace number of Reactor stall
less latest/job.log |grep 'Backtrace:' |wc -l
# Backtrace number of other Backtrace than `Reactor stall`
less latest/job.log |grep -v 'Backtrace:' |wc -l
-
nodes ready
: how many nodes are ready -
SSH access
: get instance ip -
Node setup failed
: node setup failed -
Destroy nodes
: -
scylla_setup
: scylla setup -
Scylla version
: get version string -
:9042
: CQL server starts -
:3000
: monitor related urls -
Try to start prometheus API server on port
: prometheus API server port - 'I/O Scheduler is not properly configured!'
- 'Failed to start Scylla Server'
- 'Failed to write into /proc/irq'
less latest/job.log |grep -i exception |less
less latest/job.log |grep -i error |less
less latest/job.log |grep -i fail |less
- exception
- error
- fail
- killed
- END$
- FAILURE$
Cannot find summary in c-stress results
Stress script content:
-
TAG: loader_idx:
: multiple workloads info: loader/cpu/keyspace idx
Reference the log timestamp, we know one error occurs around one nemesis, which step
$ less latest/job.log|egrep '(<<<<|>>>>)'
2019-01-21 12:01:53,677 nemesis L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method terminate_and_replace_node
2019-01-21 12:08:10,801 nemesis L0399 INFO | sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method terminate_and_replace_node
2019-01-21 12:14:08,016 nemesis L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method nodetool_refresh
2019-01-21 12:20:14,376 nemesis L0395 INFO | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method modify_table
....
More output
$ less latest/job.log|egrep -i '(<<<<|>>>>|nemesis)'
-
.png
: grafana snapshot url
`$ less sct.log |grep -E 'ing a Node|begin$|ended$|Run some Queries|Populate DB with'
Branches that we keep sending backports from master
- branch-4.0
- branch-4.1
- branch-2019.1
- branch-perf-v9
- manager-2.1
Branches that we are sending critical backports from master only
- branch-3.3
- branch-2019.1
- manager-2.0
a branch that we still have jobs for, and using, but can't get backports from master anymore
- manager-v1.2
- manager-v1.3
- manager-v1.3.1
- manager-v1.4
Older unused branches will be deleted on a regular basis.