-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve monitoring stack start up time #2433
Comments
When using the A faster (seconds vs 1-2 minutes) can be accomplished by using docker-compose. |
Closing as already done, please reopen if you think it is needed |
Why NOT use docker-compose by default then? What's the downside? |
Is https://monitoring.docs.scylladb.com/stable/install/docker-compose.html#using-docker-compose outdated? It says to use either start-all.sh or docker-compose, not mentioning '--compose' |
Is https://monitoring.docs.scylladb.com/stable/install/docker-compose.html#docker-compose-file containing an up-to-date definition? |
It seems that the docker-compose file example is up to date, but I do need to mention the option to create the docker-compose file for you using the |
What do you mean "already done"? @mykaul measured 1:30 minutes. If it's still 1:30 minutes, it's not done... If there's a better way for it to take a few seconds instead of 1:30 minutes (using a different docker setup, not using docker at all, or I don't know what) then shouldn't this issue remain open until this faster way becomes the only way or at least default way? |
There are currently two options for starting the monitoring stack. There are cons and pros to both options. The field asked that we'll keep the original one as the default. QA can change the way they start the monitoring stack. For users, 1.5 minutes makes no difference. |
Why? Both have the same dependecies. The strightforward way would be to use more parallelism in the boot (start all dockers at the same time) |
The versions at least are not the latest. etc. ('version' is outdated and not needed) |
I've added a new option for a quicker startup time: #2436 in the new method the script will not validate each of the processes as it does today. This is not the suggested way for users, but for the cloud (or QA) it could be helpful. When testing locally, the startup time was reduced from 45s to 1.5s |
Speeding up an interactive script from taking 45 seconds to 1.5 seconds is a fantastic improvement for user experience. What sort of "validate each of the processes" take the extra 43 seconds? Couldn't we do this validation 100 times faster? |
We've waited for each container to respond to 'curl'. Imagine it didn't in the first attempt - which was right after 'docker run' - so curl got connection refused (or worse, timed-out?), and waited 5 seconds. That's for each container separately. We have few of those (alertmanager, loki, promtail, grafana, prometheus) - so ~25 seconds I assume? |
@mykaul exactly. I think it's a mistake to have two options 1. fast and doesn't wait at all but not recommended, 2. slow and waits for ridiculous amounts of time needlessly. As I suggested in my review, there is a third option that should probably become the default: Start all the processes in the background, and only then wait for all of them to complete. If in practice they come up after 0.5 seconds, then add 0.1 second sleeps - not 5-second sleeps - those intra-node curl tests are basically free. My suggestion won't work if one of these servers (e.g., prometheus) genuinely takes a long time to start up. In that case I think that printouts will go a long way to making the user experience more pleasant - you'll be told that the servers were started but now we're waiting for them to come up. |
I think switching the order between 'docker ps' and 'curl' - first check that the container is running, then let 'curl' do its tests + 1 seconds timeout (with more retries) should be good enough to cut most of the time spent. |
Today (4.8.3) when using the start-all serial activation of all monitoring stack containers, it takes ~1:30m.
It could be improved.
We should look at why it takes so long (docker inspect in starting alert manager? Why?), or move to docker-compose based activation (is it faster? need to measure).
The text was updated successfully, but these errors were encountered: