update AMI and AWSParallelcluster doc

sol-eng · Nov 20, 2024 · 5409944 · 5409944
1 parent 9f1e60d
commit 5409944
Show file tree

Hide file tree

Showing 2 changed files with 198 additions and 1 deletion.
diff --git a/docs/AMI.qmd b/docs/AMI.qmd
@@ -122,4 +122,4 @@ aws ec2 describe-images \
               "Name=virtualization-type,Values=hvm" \
     --query 'Images[*].[CreationDate,ImageId]' \
     --output text | sort | tail -1 | awk '{print $2}'
-```
+```
diff --git a/docs/AWSParallelcluster.qmd b/docs/AWSParallelcluster.qmd
@@ -0,0 +1,197 @@
+---
+title: "AWS ParallelCluster"
+number-sections: true
+--- 
+
+
+# AWS ParallelCluster {#sec-aws-parallelcluster-build}
+
+## Introduction
+
+With launching the cluster via AWS ParallelCluster, everything comes together.
+
+## Python Virtual Env {#sec-python-virtual-env}
+
+The virtual environment for AWS Parallelcluster can be created from the base folder of the git repo via
+
+``` bash
+python -m venv .aws-pc-venv
+source .aws-pc-venv/bin/activate 
+pip install -r requirements.txt
+deactivate
+```
+
+You may want to add the patch described in @sec-patch-for-elb to ensure full functionality of workbench.
+
+## Prerequisites {#sec-prerequisites}
+
+-   Python Virtual Environment set up and activated (cf. @sec-python-virtual-env).
+
+-   Auxiliary Services up and running (cf. @sec-auxiliary-services)
+
+-   Custom AMI built (cf. @sec-custom-ami)
+
+-   S3 bucket set up for temporarily hosting cluster deployment files and scripts
+
+## Deployment instructions {#sec-deployment-instructions}
+
+1.  Review the cluster template in `config/cluster-config-wb.tmpl` and modify accordingly.
+
+2.  Review the `deploy.sh` script and modify accordingly, especially
+
+    1.  `CLUSTERNAME` - a human readable name of your cluster
+
+    2.  `S3_BUCKETNAME` - The name of the S3 bucket you set up in @sec-prerequisites
+
+    3.  `SECURITYGROUP_RSW` - a security group that should allow at least external access to port 443 and 8787 (the latter if no SSL is being used).
+
+    4.  `AMI` - the AMI created in @sec-how-to-build-a-custom-ami
+
+    5. `SINGULARITY_SUPPORT` - if set true, Workbench will be configured for Singularity integration and two `r-session-complete` containers (Ubuntu Jammy and Cent OS 7 based) will be built. Please note that this significantly extends the spin-up time of the cluster.
+
+## Default values for Cluster deployment
+
+For the `deploy.sh` script, unless mentioned in step 2 of the deployment instructions (cf. @sec-deployment-instructions), all relevant parameters are extracted from the pulumi deployment for the auxiliary services.
+
+The default value in the cluster template `config/cluster-config-wb.tmpl`\` are as follows
+
+-   EFS storage used for shared file systems needed by AWS ParallelCluster
+
+-   One Head Node
+
+    -   Instance `t3.xlarge`
+
+    -   100 GB of local EBS storage
+
+    -   Script `install-pwb-config.sh` triggered when head node is being deployed.
+
+-   Compute Nodes with
+
+    -   Script `config-compute.sh` triggered when compute node starts.
+
+    -   Partition `all`
+
+        -   Instance `t3.xlarge`
+
+        -   minimum/maximum number of instances: 1/10
+
+    -   Partition `gpu`
+
+        -   Instance p3.2xlarge
+
+        -   minimum/maximum number of instances; 0/1
+
+-   2 Login Nodes with
+
+    -   Instance `t3.xlarge`
+    -   ELB in front
+
+-   Shared storage for `/home` - FsX for Lustre with capacity of 1.2 TB and deployment type `SCRATCH_2`
+
+All of the above settings (Instance type, numbers, FsX size) can be changed as needed.
+
+## Notes on `install-pwb-config.sh` and `install-compute.sh` {#sec-notes-on-install-pwb-config.sh}
+
+`install-pwb-config.sh` mainly creates Posit Workbench configuration files and configures the workbench systemctl services `rstudio-launcher` and `rstudio-server` . It is only executed on the designated head node
+
+-   Workbench uses `/opt/parallelcluster/shared/rstudio/` as the base for its configuration (`PWB_BASE_DIR`). `/opt/parallelcluster/shared`\` is already created by AWS ParallelCluster and shared across all nodes (head, login and compute) so we are making use of this functionality.
+
+-   configuration files are deployed in `$PWB_BASE_DIR/etc/rstudio`
+
+-   shared storage is configured in `$PWB_BASE_DIR/shared`
+
+-   R Versions file is configured in `$PWB_BASE_DIR/shared/r-versions`
+
+-   In order to distinguish the head node from the login node, an empty file `/etc/head-node` is created. This is used in the cron job mentioned in @sec-custom-ami to help differentiate the login nodes from the head node.
+
+`ìnstall-compute.sh` script detects the presence of a GPU and then automatically updates the NVIDIA/CUDA driver and installs the CuDNN library for distributed GPU computing. This is more a nice to have but is rsther useful for distributed tensorflow etc...
+
+![Architecture diagram](Benchmark.drawio.png){#architecture}
+
+## Customisations on top of AWS ParallelCluster
+
+### Elastic Load Balancer
+
+AWS ParallelCluster is setting up an ELB for the Login nodes and ensures that the desired number of login nodes is available at any given time. The ELB is by default listening on port 22 (ssh). In order to change that one would need to patch the python scripts a bit (patch supplied in @sec-patch-for-elb)
+
+This change is simple but will effectively disable the ability to ssh into the ELB. Typically however Workbench Users do not need ssh access to login nodes - if needed, they can open a termina within the RStudio IDE, for example.
+
+An alternative would be to add a second ELB for Workbench but this would imply a significantly larger patch to AWS ParallelCluster.
+
+### The "thing" with the Login Nodes
+
+AWS ParallelCluster introduced the ability to define separate login nodes in [Version 3.7.0](https://github.com/aws/aws-parallelcluster/releases/tag/v3.7.0). This is great and replaces a rather [complicated workaround](https://github.com/aws/aws-parallelcluster/wiki/ParallelCluster:-Launching-a-Login-Node) that was in place until then. Unfortunately the team did not add the same features to the new [Login Nodes](https://docs.aws.amazon.com/parallelcluster/latest/ug/LoginNodes-v3.html) such as `OnNodeConfigured` . We have raise a [github issue](https://github.com/aws/aws-parallelcluster/issues/5723) which was acknowledged and the missing feature will be implemented in an upcoming release.
+
+As a consequence we have implemented a workaround with a cron job that runs on all ParallelCluster managed nodes (Login, Head and Compute) every minute. A login node is detected if there is a NFS mount that contains the name `login_node` and if there is no file `/etc/head-node` (the latter would signal that this is a head node indeed). See @sec-notes-on-install-pwb-config.sh for additional information.
+
+Until the [github issue](https://github.com/aws/aws-parallelcluster/issues/5723) is fixed, we will have to live with this workaround.
+
+# Summary and Conclusions
+
+This document describes a possibility on how to integrate Workbench and AWS ParallelCluster that allows for partial High Availability. The setup can tolerate login node failures and recover and as a consequence the workbench part is HA.
+
+The main ingredients for this setup is the creation of a custom AMI with all the software needed (Workbench, R, Python, ...) baked into a custom AMI that can be used for all the three node types (Login, Head and Compute Node).
+
+In order to achieve this, some additional logic has to be implemented and some workarounds for missing features in AWS ParallelCluster be used.
+
+The remaining issue is however the single head node which is a single point of failure (if the head node crashes, SLURM stops working).
+
+### How to reach "full" HA
+
+AWS paralelcluster makes a clear distinction between Head and Login nodes. This is more than justified given the fact that the Head node not only runs `slurmctld` but also can act as a NFS server exporting file systems such as `/home` , `/opt/slurm`, ... This makes the Head node a single point of failure from the perspective of the NFS server alone.
+
+With the release of [AWS ParallelCluster 3.8.0](https://github.com/aws/aws-parallelcluster/blob/develop/CHANGELOG.md#380) (currently available as beta version), all the NFS file systems can be hosted on external EFS. This removes the single point of failure for the NFS server. There is a bug in the beta version [where all but one file system can be hosted on EFS](https://github.com/aws/aws-parallelcluster/issues/5812) but this will be fixed in the official release of 3.8.0.
+
+Once this is in place, the boundaries between the Login Nodes and Head Nodes will become much less clear. With adding additional logic, one can automatically start additional `slurmctld` processes on the login nodes and configure those hosts in the slurm configuration. If the head node then fails, a `slurmctld` of one of the compute nodes will take over. While adding additional `slurmctld` is fairly straightforward, there also is a need for regular checks if all the defined `slurmctld`\` hosts are still up and running. If not, those need to be removed from the slurm config.
+
+The complexity of establishing the above is fairly small but then it is another customisation we have to make and maintain. As long as this is only a posit internal solution, we should be ok.
+
+A drawback of having full HA as mentioned above however is that very likely the ParallelCluster API may become unuseable in case the head node is no longer available. Things like updating configuration and settings of the running cluster may no longer work. Whether this is needed in a productive cluster is another matter of debate. 
+
+# Appendix
+
+## Patch for ELB to listen on port 8787 instead of 22 {#sec-patch-for-elb}
+
+```         
+diff -u --recursive pcluster/templates/cluster_stack.py pcluster.new/templates/cluster_stack.py
+--- pcluster/templates/cluster_stack.py 2023-11-22 12:25:53
++++ pcluster.new/templates/cluster_stack.py 2023-11-22 15:11:48
+@@ -871,10 +871,10 @@
+     def _get_source_ingress_rule(self, setting):
+         if setting.startswith("pl"):
+             return ec2.CfnSecurityGroup.IngressProperty(
+-                ip_protocol="tcp", from_port=22, to_port=22, source_prefix_list_id=setting
++                ip_protocol="tcp", from_port=8787, to_port=8787, source_prefix_list_id=setting
+             )
+         else:
+-            return ec2.CfnSecurityGroup.IngressProperty(ip_protocol="tcp", from_port=22, to_port=22, cidr_ip=setting)
++            return ec2.CfnSecurityGroup.IngressProperty(ip_protocol="tcp", from_port=8787, to_port=8787, cidr_ip=setting)
+ 
+     def _add_login_nodes_security_group(self):
+         login_nodes_security_group_ingress = [
+diff -u --recursive pcluster/templates/login_nodes_stack.py pcluster.new/templates/login_nodes_stack.py
+--- pcluster/templates/login_nodes_stack.py 2023-11-22 12:25:53
++++ pcluster.new/templates/login_nodes_stack.py 2023-11-22 15:11:19
+@@ -273,10 +273,10 @@
+             self,
+             f"{self._pool.name}TargetGroup",
+             health_check=elbv2.HealthCheck(
+-                port="22",
++                port="8787",
+                 protocol=elbv2.Protocol.TCP,
+             ),
+-            port=22,
++            port=8787,
+             protocol=elbv2.Protocol.TCP,
+             target_type=elbv2.TargetType.INSTANCE,
+             vpc=self._vpc,
+@@ -299,7 +299,7 @@
+             ),
+         )
+ 
+-        listener = login_nodes_load_balancer.add_listener(f"LoginNodesListener{self._pool.name}", port=22)
++        listener = login_nodes_load_balancer.add_listener(f"LoginNodesListener{self._pool.name}", port=8787)
+         listener.add_target_groups(f"LoginNodesListenerTargets{self._pool.name}", target_group)
+         return login_nodes_load_balancer
+ 
+```