Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve monitoring stack start up time #2433

Closed
mykaul opened this issue Dec 24, 2024 · 15 comments
Closed

Improve monitoring stack start up time #2433

mykaul opened this issue Dec 24, 2024 · 15 comments
Labels
enhancement New feature or request

Comments

@mykaul
Copy link
Contributor

mykaul commented Dec 24, 2024

Today (4.8.3) when using the start-all serial activation of all monitoring stack containers, it takes ~1:30m.
It could be improved.
We should look at why it takes so long (docker inspect in starting alert manager? Why?), or move to docker-compose based activation (is it faster? need to measure).

@mykaul mykaul added the enhancement New feature or request label Dec 24, 2024
@amnonh
Copy link
Collaborator

amnonh commented Dec 24, 2024

When using the start-all.sh script, there are a bunch of dependencies between the different components that the script resolves.
It also makes it clearer what is going on.

A faster (seconds vs 1-2 minutes) can be accomplished by using docker-compose.
Docker compose is supported and can even be created automatically using the --compose command line argument to start-all.sh

@amnonh
Copy link
Collaborator

amnonh commented Dec 24, 2024

Closing as already done, please reopen if you think it is needed

@amnonh amnonh closed this as completed Dec 24, 2024
@mykaul
Copy link
Contributor Author

mykaul commented Dec 24, 2024

Why NOT use docker-compose by default then? What's the downside?

@mykaul
Copy link
Contributor Author

mykaul commented Dec 24, 2024

Is https://monitoring.docs.scylladb.com/stable/install/docker-compose.html#using-docker-compose outdated? It says to use either start-all.sh or docker-compose, not mentioning '--compose'

@mykaul
Copy link
Contributor Author

mykaul commented Dec 24, 2024

@amnonh
Copy link
Collaborator

amnonh commented Dec 24, 2024

It seems that the docker-compose file example is up to date, but I do need to mention the option to create the docker-compose file for you using the start-all.sh or the make-compose.sh scripts

@nyh
Copy link
Contributor

nyh commented Dec 24, 2024

Closing as already done, please reopen if you think it is needed

What do you mean "already done"? @mykaul measured 1:30 minutes. If it's still 1:30 minutes, it's not done... If there's a better way for it to take a few seconds instead of 1:30 minutes (using a different docker setup, not using docker at all, or I don't know what) then shouldn't this issue remain open until this faster way becomes the only way or at least default way?

@amnonh
Copy link
Collaborator

amnonh commented Dec 24, 2024

There are currently two options for starting the monitoring stack. There are cons and pros to both options. The field asked that we'll keep the original one as the default.

QA can change the way they start the monitoring stack. For users, 1.5 minutes makes no difference.

@tzach
Copy link
Contributor

tzach commented Dec 25, 2024

A faster (seconds vs 1-2 minutes) can be accomplished by using docker-compose.

Why? Both have the same dependecies.

The strightforward way would be to use more parallelism in the boot (start all dockers at the same time)

@mykaul
Copy link
Contributor Author

mykaul commented Dec 25, 2024

It seems that the docker-compose file example is up to date, but I do need to mention the option to create the docker-compose file for you using the start-all.sh or the make-compose.sh scripts

The versions at least are not the latest.
PROMETHEUS_VERSION=v2.54.1 - vs. image: prom/prometheus:v2.51.1
GRAFANA_VERSION=11.3.0 - vs. image: grafana/grafana:10.4.1

etc.

('version' is outdated and not needed)

@amnonh
Copy link
Collaborator

amnonh commented Dec 25, 2024

I've added a new option for a quicker startup time: #2436 in the new method the script will not validate each of the processes as it does today. This is not the suggested way for users, but for the cloud (or QA) it could be helpful.

When testing locally, the startup time was reduced from 45s to 1.5s

@nyh
Copy link
Contributor

nyh commented Dec 26, 2024

Speeding up an interactive script from taking 45 seconds to 1.5 seconds is a fantastic improvement for user experience. What sort of "validate each of the processes" take the extra 43 seconds? Couldn't we do this validation 100 times faster?
(I don't know anything about the details of how this stuff works, but I wouldn't give up the ability to run something in 1.5 seconds if we can).

@mykaul
Copy link
Contributor Author

mykaul commented Dec 26, 2024

Speeding up an interactive script from taking 45 seconds to 1.5 seconds is a fantastic improvement for user experience. What sort of "validate each of the processes" take the extra 43 seconds? Couldn't we do this validation 100 times faster? (I don't know anything about the details of how this stuff works, but I wouldn't give up the ability to run something in 1.5 seconds if we can).

We've waited for each container to respond to 'curl'. Imagine it didn't in the first attempt - which was right after 'docker run' - so curl got connection refused (or worse, timed-out?), and waited 5 seconds. That's for each container separately. We have few of those (alertmanager, loki, promtail, grafana, prometheus) - so ~25 seconds I assume?

@nyh
Copy link
Contributor

nyh commented Dec 26, 2024

@mykaul exactly. I think it's a mistake to have two options 1. fast and doesn't wait at all but not recommended, 2. slow and waits for ridiculous amounts of time needlessly. As I suggested in my review, there is a third option that should probably become the default:

Start all the processes in the background, and only then wait for all of them to complete. If in practice they come up after 0.5 seconds, then add 0.1 second sleeps - not 5-second sleeps - those intra-node curl tests are basically free.

My suggestion won't work if one of these servers (e.g., prometheus) genuinely takes a long time to start up. In that case I think that printouts will go a long way to making the user experience more pleasant - you'll be told that the servers were started but now we're waiting for them to come up.

@mykaul
Copy link
Contributor Author

mykaul commented Dec 26, 2024

@mykaul exactly. I think it's a mistake to have two options 1. fast and doesn't wait at all but not recommended, 2. slow and waits for ridiculous amounts of time needlessly. As I suggested in my review, there is a third option that should probably become the default:

Start all the processes in the background, and only then wait for all of them to complete. If in practice they come up after 0.5 seconds, then add 0.1 second sleeps - not 5-second sleeps - those intra-node curl tests are basically free.

My suggestion won't work if one of these servers (e.g., prometheus) genuinely takes a long time to start up. In that case I think that printouts will go a long way to making the user experience more pleasant - you'll be told that the servers were started but now we're waiting for them to come up.

I think switching the order between 'docker ps' and 'curl' - first check that the container is running, then let 'curl' do its tests + 1 seconds timeout (with more retries) should be good enough to cut most of the time spent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants