-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
installation of new cluster doesn't complete #34
Comments
The first thing to be aware of is that the writing of the log file seems to suffer some buffering issues sometimes so the latest thing printed in there is not necessarily the latest task run. The processes running there all make sense and I don't see any that would be likely to cause problems. My two ideas for debugging it are:
To run Ansible manually, sudo to root and, from root's home directory run:
I have made some changes to the Ansible in the last few days but the tests I've run on Google and Oracle have worked without issue. |
Thanks a lot for the quick feedback! I checked with I restarted the Ansible playbook, and it's definitely progressing now; it's currently building the initial compute node image:
I also see that the packer instance was started. If I check where it was hanging previously, it seems like it didn't manage to get passed the
|
That is indeed strange that it would hang on calling |
All good now... Any suggestions on how to debug this further if it occurs again? |
One thing I've seen mentioned in my searches is the issue of available RAM. It's worth a check to see if it is running low on memory. In that case, we might need to bump up the instance type a notch. |
I'll close this for now, if it happens again I'll get back to it... |
Opening this again, since the problem seems persistent... I've started 3 clusters today, all showed the same problem: installation "hangs" at (or right after) " Last syslog entries related to
The @milliams Any idea how I can check whether it actually completed the |
I've made two attempts this afternoon to create a new CitC on AWS using the one-click installer, but for some reason the installation "hangs".
The management node is being created, and I can SSH into that, but the
finish
command keep producing this (with or without alimits.yaml
file):The last part in
/root/ansible-pull.log
is this:That was over 1 hour ago, no progress since then...
/var/log/slurm
exists, but it entirely empty.Running processes:
Any suggestions on how to figure out what went wrong?
The text was updated successfully, but these errors were encountered: