Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

installation of new cluster doesn't complete #34

Open
boegel opened this issue Feb 19, 2021 · 7 comments
Open

installation of new cluster doesn't complete #34

boegel opened this issue Feb 19, 2021 · 7 comments
Labels
AWS bug Something isn't working

Comments

@boegel
Copy link

boegel commented Feb 19, 2021

I've made two attempts this afternoon to create a new CitC on AWS using the one-click installer, but for some reason the installation "hangs".

The management node is being created, and I can SSH into that, but the finish command keep producing this (with or without a limits.yaml file):

[citc@mgmt ~]$ finish
Error: The management node has not finished its setup
Please allow it to finish before continuing.
For information about why they have not finished, check the file /root/ansible-pull.log

The last part in /root/ansible-pull.log is this:

TASK [slurm : open all ports] **************************************************
Friday 19 February 2021  14:19:11 +0000 (0:00:00.045)       0:06:17.021 *******

That was over 1 hour ago, no progress since then...

/var/log/slurm exists, but it entirely empty.

Running processes:

root        1515  0.0  1.0 372592 40816 ?        Ss   14:12   0:00 /usr/libexec/platform-python /usr/bin/cloud-init modules --mode=final
root        1997  0.0  0.0 217052   732 ?        S    14:12   0:00  \_ tee -a /var/log/cloud-init-output.log
root        2037  0.0  0.0 235744  3412 ?        S    14:12   0:00  \_ /bin/bash /var/lib/cloud/instance/scripts/part-001
root        4767  0.0  0.9 406240 34832 ?        S    14:12   0:00      \_ /usr/bin/python3 -u /usr/bin/ansible-pull --url=https://github.com/clusterinthecloud/ansible.git --checkout=6 --inventory=/root/hosts management.yml
root        9929  7.3  1.6 590508 61548 ?        Sl   14:12   5:24          \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1.com
root       27615  0.0  1.4 583004 54488 ?        S    14:19   0:00              \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1
root       27616  0.0  0.0 235744  3372 ?        S    14:19   0:00                  \_ /bin/sh -c /usr/libexec/platform-python && sleep 0
root       27617  0.0  0.8 415588 30484 ?        S    14:19   0:00                      \_ /usr/libexec/platform-python
dirsrv     17078  0.1  2.1 662068 81740 ?        Ssl  14:14   0:06 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-mgmt -i /run/dirsrv/slapd-mgmt.pid
citc       17138  0.0  0.2  93904  9968 ?        Ss   14:15   0:00 /usr/lib/systemd/systemd --user
citc       17142  0.0  0.1 257440  5068 ?        S    14:15   0:00  \_ (sd-pam)
mysql      21671  0.0  2.4 1776020 93568 ?       Ssl  14:15   0:01 /usr/libexec/mysqld --basedir=/usr
munge      22577  0.0  0.1 125220  4048 ?        Sl   14:17   0:00 /usr/sbin/munged
root       24674  0.0  1.0 509096 41380 ?        Ssl  14:17   0:00 /usr/libexec/platform-python -s /usr/sbin/firewalld --nofork --nopid
root       27703  0.0  0.0 232532  2036 ?        Ss   15:01   0:00 /usr/sbin/anacron -s

Any suggestions on how to figure out what went wrong?

@milliams
Copy link
Member

The first thing to be aware of is that the writing of the log file seems to suffer some buffering issues sometimes so the latest thing printed in there is not necessarily the latest task run.

The processes running there all make sense and I don't see any that would be likely to cause problems.

My two ideas for debugging it are:

  1. check lsof to see if there's anything that give a hint as to what is hanging
  2. kill the Ansible run and run it again manually.

To run Ansible manually, sudo to root and, from root's home directory run:

/usr/bin/ansible-pull --url=https://github.com/clusterinthecloud/ansible.git --checkout=6 --inventory=/root/hosts management.yml

I have made some changes to the Ansible in the last few days but the tests I've run on Google and Oracle have worked without issue.

@milliams milliams added AWS bug Something isn't working labels Feb 19, 2021
@boegel
Copy link
Author

boegel commented Feb 19, 2021

Thanks a lot for the quick feedback!

I checked with lsof, but couldn't seem to find any clues on what went wrong...

I restarted the Ansible playbook, and it's definitely progressing now; it's currently building the initial compute node image:

TASK [finalise : Wait for packer to finish]

I also see that the packer instance was started.

If I check where it was hanging previously, it seems like it didn't manage to get passed the slurm: open all ports task for some reason, since it now indicates that changes were made there?

...

TASK [set slurm log directory permissions] *************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:19 +0000 (0:00:00.507)       0:00:37.562 *******
ok: [mgmt.clever-pipefish.citc.local]

TASK [set slurm spool directory permissions] ***********************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:19 +0000 (0:00:00.233)       0:00:37.795 *******
ok: [mgmt.clever-pipefish.citc.local]

TASK [set slurmd config directory permissions] *********************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:20 +0000 (0:00:00.255)       0:00:38.051 *******
skipping: [mgmt.clever-pipefish.citc.local]

TASK [slurm : open all ports] **************************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:20 +0000 (0:00:00.042)       0:00:38.093 *******
changed: [mgmt.clever-pipefish.citc.local]

TASK [slurm : include_tasks] ***************************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:20 +0000 (0:00:00.765)       0:00:38.859 *******
included: /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/roles/slurm/tasks/elastic.yml for mgmt.clever-pipefish.citc.local

TASK [slurm : install common tools] ********************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:21 +0000 (0:00:00.072)       0:00:38.932 *******
changed: [mgmt.clever-pipefish.citc.local]

...

@milliams
Copy link
Member

That is indeed strange that it would hang on calling firewall-cmd, especially it happening twice in a row. Hopefully it runs to the end now.

@boegel
Copy link
Author

boegel commented Feb 19, 2021

All good now...

Any suggestions on how to debug this further if it occurs again?

@milliams
Copy link
Member

One thing I've seen mentioned in my searches is the issue of available RAM. It's worth a check to see if it is running low on memory. In that case, we might need to bump up the instance type a notch.

@boegel
Copy link
Author

boegel commented Feb 19, 2021

I'll close this for now, if it happens again I'll get back to it...

@boegel boegel closed this as completed Feb 19, 2021
@boegel
Copy link
Author

boegel commented Feb 19, 2021

Opening this again, since the problem seems persistent...

I've started 3 clusters today, all showed the same problem: installation "hangs" at (or right after) "slurm : open all ports".

Last syslog entries related to ansible:

Feb 19 19:47:39 ip-10-0-70-26 platform-python[27532]: ansible-copy Invoked with src=/root/.ansible/tmp/ansible-tmp-1613764059.0596786-27517-206051842621501/source dest=/etc/slurm/slurmdbd.conf owner=slurm group=slurm mode=256 follow=False _original_basename=slurmdbd.conf.j2 checksum=153cbb7266311ee247d8e07a342de8f5c1665b73 backup=False force=True content=NOT_LOGGING_PARAMETER validate=None directory_mode=None remote_src=None local_follow=None seuser=None serole=None selevel=None setype=None attributes=None regexp=None delimiter=None unsafe_writes=None
Feb 19 19:47:39 ip-10-0-70-26 platform-python[27549]: ansible-stat Invoked with path=/etc/slurm/cgroup.conf follow=False get_checksum=True checksum_algorithm=sha1 get_md5=False get_mime=True get_attributes=True
Feb 19 19:47:39 ip-10-0-70-26 platform-python[27554]: ansible-copy Invoked with src=/root/.ansible/tmp/ansible-tmp-1613764059.565857-27539-164394628155763/source dest=/etc/slurm/cgroup.conf owner=slurm group=slurm mode=256 follow=False _original_basename=cgroup.conf.j2 checksum=d8c0923ce4d0c61ce36025522d610963e987e556 backup=False force=True content=NOT_LOGGING_PARAMETER validate=None directory_mode=None remote_src=None local_follow=None seuser=None serole=None selevel=None setype=None attributes=None regexp=None delimiter=None unsafe_writes=None
Feb 19 19:47:40 ip-10-0-70-26 platform-python[27563]: ansible-file Invoked with path=/var/log/slurm/ state=directory owner=slurm group=slurm mode=493 recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _original_basename=None _diff_peek=None src=None modification_time=None access_time=None seuser=None serole=None selevel=None setype=None attributes=None content=NOT_LOGGING_PARAMETER backup=None remote_src=None regexp=None delimiter=None directory_mode=None unsafe_writes=None
Feb 19 19:47:40 ip-10-0-70-26 platform-python[27568]: ansible-file Invoked with path=/var/spool/slurm/ state=directory owner=slurm group=slurm mode=493 recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _original_basename=None _diff_peek=None src=None modification_time=None access_time=None seuser=None serole=None selevel=None setype=None attributes=None content=NOT_LOGGING_PARAMETER backup=None remote_src=None regexp=None delimiter=None directory_mode=None unsafe_writes=None

The /var/spool/slurm/ directory was created, but somehow it's stuck after that?

@milliams Any idea how I can check whether it actually completed the firewalld configuration where it seems to be stuck on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants