-
Notifications
You must be signed in to change notification settings - Fork 91
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added training plan logic into automate script (#510)
Training plans logic added in to automate script.
- Loading branch information
1 parent
f1eaede
commit a306252
Showing
3 changed files
with
234 additions
and
20 deletions.
There are no files selected for viewing
97 changes: 91 additions & 6 deletions
97
1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,94 @@ | ||
# Automate SageMaker HyperPod Cluster Creation | ||
A bash script that automates the manual cluster creation process for SageMaker HyperPod SLURM | ||
# SageMaker Hyperpod Cluster Automation Script | ||
|
||
This automates the steps from the [SageMaker HyperPod SLURM Workshop](https://catalog.workshops.aws/sagemaker-hyperpod/en-US) | ||
This project provides a script to automate the creation and setup of a SageMaker Hyperpod cluster with SLURM integration. | ||
|
||
## 🚀 Installation and Usage | ||
Using this script is very simple. Run ```bash automate-cluster-creation.sh``` | ||
The automation script streamlines the process of setting up a distributed training environment using AWS SageMaker Hyperpod. | ||
It handles the installation and configuration of necessary tools, clones the required repository, sets up environment variables, and configures lifecycle scripts for the SageMaker Hyperpod architecture. | ||
|
||
The script will walk you through creating the cluster configuration for your SageMaker HyperPod Slurm cluster. Please read through the instructions provided while running the script for the best experience. | ||
## Demo | ||
|
||
![SageMaker Hyperpod Cluster Automation Demo](/1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/media/automate-smhp-demo.gif) | ||
|
||
This demo gif showcases the step-by-step process of creating and setting up a SageMaker Hyperpod cluster using our automation script. | ||
|
||
- `automate-cluster-creation.sh`: The main script that automates the cluster creation process. | ||
- `README.md`: This file, providing information about the project. | ||
|
||
## Usage Instructions | ||
|
||
### Prerequisites | ||
|
||
- AWS CLI (version 2.17.1 or higher) | ||
- Git | ||
- Bash shell environment | ||
- AWS account with appropriate permissions | ||
|
||
### Installation | ||
|
||
1. Clone this repository: | ||
```bash | ||
git clone https://github.com/aws-samples/awsome-distributed-training.git | ||
cd 1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm | ||
``` | ||
|
||
2. Make the script executable: | ||
```bash | ||
chmod +x automate-cluster-creation.sh | ||
``` | ||
|
||
### Running the Script | ||
|
||
Execute the script: | ||
|
||
```bash | ||
./automate-cluster-creation.sh | ||
``` | ||
|
||
The script will guide you through the following steps: | ||
|
||
1. Check and install/update AWS CLI if necessary. | ||
2. Verify Git installation. | ||
3. Clone the "awsome-distributed-training" repository. | ||
4. Set up environment variables. | ||
5. Configure lifecycle scripts for SageMaker Hyperpod. | ||
|
||
### Configuration | ||
|
||
During execution, you'll be prompted to provide the following information: | ||
|
||
- Name of the SageMaker VPC CloudFormation stack (default: sagemaker-hyperpod) | ||
- Confirmation if you deployed the optional hyperpod-observability CloudFormation stack | ||
- Instance group configuration (group name, instance type, instance count, etc) | ||
|
||
### Troubleshooting | ||
|
||
- If you encounter permission issues when attaching IAM policies, the script will provide options to: | ||
1. Run `aws configure` as an admin user within the script. | ||
2. Exit the script to run `aws configure` manually. | ||
3. Continue without configuring this step. | ||
|
||
- If environment variable generation fails: | ||
1. You can choose to continue with the rest of the script (not recommended unless you know how to set the variables manually). | ||
2. Exit the script to troubleshoot the issue. | ||
|
||
## Data Flow | ||
|
||
The automation script follows this general flow: | ||
|
||
1. Check and setup prerequisites (AWS CLI, Git) | ||
2. Clone necessary repositories | ||
3. Set up environment variables | ||
4. Configure lifecycle scripts | ||
5. Enable observability (if applicable) | ||
6. Attach IAM policies (if applicable) | ||
7. Cluster Configuration | ||
8. Create cluster | ||
|
||
``` | ||
[Prerequisites] -> [Clone Repos] -> [Setup Env Vars] -> [Configure LCS] -> [Enable Observability] -> [Attach IAM Policies] -> [Create Cluster configuration] -> [Create cluster] | ||
``` | ||
|
||
Important technical considerations: | ||
- Ensure you have the necessary AWS permissions before running the script. | ||
- The script modifies the `config.py` file to enable observability if selected. | ||
- IAM policy attachment requires admin permissions. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+16.5 MB
...itectures/5.sagemaker-hyperpod/automate-smhp-slurm/media/automate-smhp-demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.