Improve monitoring stack start up time #2433

mykaul · 2024-12-24T12:33:08Z

Today (4.8.3) when using the start-all serial activation of all monitoring stack containers, it takes ~1:30m.
It could be improved.
We should look at why it takes so long (docker inspect in starting alert manager? Why?), or move to docker-compose based activation (is it faster? need to measure).

amnonh · 2024-12-24T14:32:08Z

When using the start-all.sh script, there are a bunch of dependencies between the different components that the script resolves.
It also makes it clearer what is going on.

A faster (seconds vs 1-2 minutes) can be accomplished by using docker-compose.
Docker compose is supported and can even be created automatically using the --compose command line argument to start-all.sh

amnonh · 2024-12-24T14:32:24Z

Closing as already done, please reopen if you think it is needed

mykaul · 2024-12-24T15:56:29Z

Why NOT use docker-compose by default then? What's the downside?

mykaul · 2024-12-24T15:57:34Z

Is https://monitoring.docs.scylladb.com/stable/install/docker-compose.html#using-docker-compose outdated? It says to use either start-all.sh or docker-compose, not mentioning '--compose'

mykaul · 2024-12-24T15:58:23Z

Is https://monitoring.docs.scylladb.com/stable/install/docker-compose.html#docker-compose-file containing an up-to-date definition?

amnonh · 2024-12-24T18:35:41Z

It seems that the docker-compose file example is up to date, but I do need to mention the option to create the docker-compose file for you using the start-all.sh or the make-compose.sh scripts

nyh · 2024-12-24T18:46:06Z

Closing as already done, please reopen if you think it is needed

What do you mean "already done"? @mykaul measured 1:30 minutes. If it's still 1:30 minutes, it's not done... If there's a better way for it to take a few seconds instead of 1:30 minutes (using a different docker setup, not using docker at all, or I don't know what) then shouldn't this issue remain open until this faster way becomes the only way or at least default way?

amnonh · 2024-12-24T21:32:34Z

There are currently two options for starting the monitoring stack. There are cons and pros to both options. The field asked that we'll keep the original one as the default.

QA can change the way they start the monitoring stack. For users, 1.5 minutes makes no difference.

tzach · 2024-12-25T07:03:53Z

A faster (seconds vs 1-2 minutes) can be accomplished by using docker-compose.

Why? Both have the same dependecies.

The strightforward way would be to use more parallelism in the boot (start all dockers at the same time)

mykaul · 2024-12-25T07:49:11Z

It seems that the docker-compose file example is up to date, but I do need to mention the option to create the docker-compose file for you using the start-all.sh or the make-compose.sh scripts

The versions at least are not the latest.
PROMETHEUS_VERSION=v2.54.1 - vs. image: prom/prometheus:v2.51.1
GRAFANA_VERSION=11.3.0 - vs. image: grafana/grafana:10.4.1

etc.

('version' is outdated and not needed)

amnonh · 2024-12-25T23:46:08Z

I've added a new option for a quicker startup time: #2436 in the new method the script will not validate each of the processes as it does today. This is not the suggested way for users, but for the cloud (or QA) it could be helpful.

When testing locally, the startup time was reduced from 45s to 1.5s

nyh · 2024-12-26T08:31:14Z

Speeding up an interactive script from taking 45 seconds to 1.5 seconds is a fantastic improvement for user experience. What sort of "validate each of the processes" take the extra 43 seconds? Couldn't we do this validation 100 times faster?
(I don't know anything about the details of how this stuff works, but I wouldn't give up the ability to run something in 1.5 seconds if we can).

mykaul · 2024-12-26T08:45:13Z

Speeding up an interactive script from taking 45 seconds to 1.5 seconds is a fantastic improvement for user experience. What sort of "validate each of the processes" take the extra 43 seconds? Couldn't we do this validation 100 times faster? (I don't know anything about the details of how this stuff works, but I wouldn't give up the ability to run something in 1.5 seconds if we can).

We've waited for each container to respond to 'curl'. Imagine it didn't in the first attempt - which was right after 'docker run' - so curl got connection refused (or worse, timed-out?), and waited 5 seconds. That's for each container separately. We have few of those (alertmanager, loki, promtail, grafana, prometheus) - so ~25 seconds I assume?

nyh · 2024-12-26T08:54:42Z

@mykaul exactly. I think it's a mistake to have two options 1. fast and doesn't wait at all but not recommended, 2. slow and waits for ridiculous amounts of time needlessly. As I suggested in my review, there is a third option that should probably become the default:

Start all the processes in the background, and only then wait for all of them to complete. If in practice they come up after 0.5 seconds, then add 0.1 second sleeps - not 5-second sleeps - those intra-node curl tests are basically free.

My suggestion won't work if one of these servers (e.g., prometheus) genuinely takes a long time to start up. In that case I think that printouts will go a long way to making the user experience more pleasant - you'll be told that the servers were started but now we're waiting for them to come up.

mykaul · 2024-12-26T09:12:58Z

@mykaul exactly. I think it's a mistake to have two options 1. fast and doesn't wait at all but not recommended, 2. slow and waits for ridiculous amounts of time needlessly. As I suggested in my review, there is a third option that should probably become the default:

Start all the processes in the background, and only then wait for all of them to complete. If in practice they come up after 0.5 seconds, then add 0.1 second sleeps - not 5-second sleeps - those intra-node curl tests are basically free.

My suggestion won't work if one of these servers (e.g., prometheus) genuinely takes a long time to start up. In that case I think that printouts will go a long way to making the user experience more pleasant - you'll be told that the servers were started but now we're waiting for them to come up.

I think switching the order between 'docker ps' and 'curl' - first check that the container is running, then let 'curl' do its tests + 1 seconds timeout (with more retries) should be good enough to cut most of the time spent.

mykaul added the enhancement New feature or request label Dec 24, 2024

amnonh closed this as completed Dec 24, 2024

nyh mentioned this issue Dec 26, 2024

Add Quick startup #2436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve monitoring stack start up time #2433

Improve monitoring stack start up time #2433

mykaul commented Dec 24, 2024

amnonh commented Dec 24, 2024

amnonh commented Dec 24, 2024

mykaul commented Dec 24, 2024

mykaul commented Dec 24, 2024

mykaul commented Dec 24, 2024

amnonh commented Dec 24, 2024

nyh commented Dec 24, 2024

amnonh commented Dec 24, 2024

tzach commented Dec 25, 2024

mykaul commented Dec 25, 2024

amnonh commented Dec 25, 2024

nyh commented Dec 26, 2024

mykaul commented Dec 26, 2024

nyh commented Dec 26, 2024

mykaul commented Dec 26, 2024

Improve monitoring stack start up time #2433

Improve monitoring stack start up time #2433

Comments

mykaul commented Dec 24, 2024

amnonh commented Dec 24, 2024

amnonh commented Dec 24, 2024

mykaul commented Dec 24, 2024

mykaul commented Dec 24, 2024

mykaul commented Dec 24, 2024

amnonh commented Dec 24, 2024

nyh commented Dec 24, 2024

amnonh commented Dec 24, 2024

tzach commented Dec 25, 2024

mykaul commented Dec 25, 2024

amnonh commented Dec 25, 2024

nyh commented Dec 26, 2024

mykaul commented Dec 26, 2024

nyh commented Dec 26, 2024

mykaul commented Dec 26, 2024