-
Notifications
You must be signed in to change notification settings - Fork 91
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add database deployment for Slurm accounting. (#452)
* Add accounting fatabse deployment Close #450 * Update 1.architectures/8.accounting-database/README.md Co-authored-by: Sean Smith <[email protected]> * Fix numbering in DB accounting * Fix database config steps --------- Co-authored-by: Sean Smith <[email protected]>
- Loading branch information
1 parent
32fb5ae
commit 86ce468
Showing
2 changed files
with
352 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
# Deploy an accounting database for Slurm | ||
|
||
This sample setup a Amazon RDS database for HPC and Machine Learning cluster. | ||
You can use for Slurm accounting and generate report of your cluster usage. | ||
For more information you can visit Slurm documentation on [accounting](https://slurm.schedmd.com/accounting.html). | ||
|
||
You will need at least two private subnets in different avaibility zones to deploy the database. | ||
|
||
## Deploy | ||
|
||
Deploy the accounting database using the 1-click deploy: | ||
|
||
[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https%3A%2F%2Fawsome-distributed-training.s3.amazonaws.com%2Ftemplates%2Fcf_database-accounting.yaml&stackName=slurm-accounting-database) | ||
|
||
**Note** or you can deploy using AWS cli and CloudFormation: | ||
```bash | ||
aws cloudformation deploy --stack-name slurm-accounting-database \ | ||
--template-file cf_database-accounting.yaml \ | ||
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \ | ||
--parameter-overrides VpcId=XXX SubnetIds=XXX,XXX | ||
``` | ||
|
||
## Get database parameters | ||
In this section, you will retrieve the database parameter that are used by Slurm to connect to the accounting database. | ||
|
||
### Get the Database URI | ||
|
||
Retrieve the `DatabaseHost` to connect to the LDAP User Interface. | ||
```bash | ||
DATABASE_URI=$(aws cloudformation describe-stacks \ | ||
--stack-name slurm-accounting-database \ | ||
--query 'Stacks[0].Outputs[?OutputKey==`DatabaseHost`].OutputValue' \ | ||
--output text) | ||
``` | ||
Copy URL into a Web Browser. | ||
|
||
### Get the Database Admin User | ||
The database admin user is by default `custeradmin` if you didn't change it on creation. | ||
|
||
Get the Database admin user name | ||
```bash | ||
DATABASE_ADMIN=$(aws cloudformation describe-stacks \ | ||
--stack-name slurm-accounting-database \ | ||
--query 'Stacks[0].Outputs[?OutputKey==`DatabaseAdminUser`].OutputValue' \ | ||
--output text) | ||
``` | ||
|
||
### Get the Database password | ||
The password to access the database was generated randomly and stored in AWS Secret Manager under `AccountingClusterAdminSecre-XXX` output of the cloudformation stack. | ||
|
||
Get the Secret ARN | ||
```bash | ||
DATABASE_SECRET_ARN=$(aws cloudformation describe-stacks \ | ||
--stack-name slurm-accounting-database \ | ||
--query 'Stacks[0].Outputs[?OutputKey==`DatabaseSecretArn`].OutputValue' \ | ||
--output text) | ||
``` | ||
|
||
## Configure AWS ParallelCluster | ||
Starting with version 3.3.0, AWS ParallelCluster supports Slurm accounting with the cluster configuration parameter `SlurmSettings / Database`. | ||
|
||
To use the database created previously for accounting, add the following in the `SlurmSettings` section of your cluster configuration file: | ||
|
||
```yaml | ||
Database: | ||
Uri: ${DATABASE_URI}:3306 | ||
UserName: ${DATABASE_ADMIN} | ||
PasswordSecretArn: ${DATABASE_SECRET_ARN} | ||
CustomSlurmSettings: | ||
# Enable accounting for GPU resources. | ||
# - AccountingStorageTRES: gres/gpu | ||
- AccountingStorageTRES: gres/gpu | ||
``` | ||
## Amazon SageMaker HyperPod Orchestrated by Slurm | ||
There are two steps to setup Slurm with the accounting database: | ||
1. Add database configuration file | ||
1. Configure Slurm accounting | ||
### Add database configuration file | ||
You need to execute the following command on the controller node to configure the database connectivity for Slurm. | ||
```bash | ||
cat > /opt/slurm/etc/slurmdbd.conf << EOF | ||
AuthType=auth/munge | ||
DbdHost=$(hostname) # Slurm controller ip address. | ||
DbdPort=6819 | ||
SlurmUser=slurm | ||
LogFile=/var/log/slurmdbd.log | ||
StorageType=accounting_storage/mysql | ||
StorageUser=${DATABASE_ADMIN} | ||
StoragePass=$(aws secretsmanager get-secret-value --secret-id ${DATABASE_SECRET_ARN} --query SecretString --output text) | ||
StorageHost=${DATABASE_URI} | ||
StoragePort=3306 | ||
EOF | ||
``` | ||
|
||
### Configure Slurm accounting | ||
|
||
```bash | ||
cat >> /opt/slurm/etc/slurm.conf << EOF | ||
# ACCOUNTING | ||
JobAcctGatherType=jobacct_gather/linux | ||
JobAcctGatherFrequency=60 | ||
AccountingStorageType=accounting_storage/slurmdbd | ||
AccountingStorageHost=$(hostname) | ||
AccountingStorageUser=${DATABASE_ADMIN} | ||
AccountingStoragePort=6819 | ||
AccountingStorageTRES=gres/gpu | ||
EOF | ||
``` | ||
|
||
Restart the slurmctld to pickup the configuration change. | ||
```bash | ||
sudo systemctl restart slurmdctld | ||
sudo scontrol reconfigure | ||
``` | ||
|
||
For more info how to use Slurm accounting you can read some examples on the [HPC blog](https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/) | ||
|
||
## Delete the database | ||
Once you delete your cluster no longer need to keep Slurm acocunting data, you can delete the database. | ||
You can use the command below to delete the AWs CloudFormation stack of the database. | ||
**ALL** accounting will be deleted. | ||
|
||
```bash | ||
aws cloudformation delete-stack --stack-name slurm-accounting-database | ||
``` |
224 changes: 224 additions & 0 deletions
224
1.architectures/8.accounting-database/cf_database-accounting.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,224 @@ | ||
AWSTemplateFormatVersion: 2010-09-09 | ||
Description: >- | ||
This template creates a database for Slurm Accounting. | ||
Metadata: | ||
AWS::CloudFormation::Interface: | ||
ParameterGroups: | ||
- Label: | ||
default: "Database Cluster Configuration" | ||
Parameters: | ||
- ClusterName | ||
- ClusterAdmin | ||
- AdminPasswordSecretString | ||
- MinCapacity | ||
- MaxCapacity | ||
ParameterLabels: | ||
ClusterName: | ||
default: "The name of the database cluster" | ||
ClusterAdmin: | ||
default: "The database administrator user name." | ||
AdminPasswordSecretString: | ||
default: "The administrator password." | ||
MinCapacity: | ||
default: "The minimum scaling capacity of the database cluster." | ||
MaxCapacity: | ||
default: "The maximum scaling capacity of the database cluster." | ||
VpcId: | ||
default: "VPC ID in which the cluster will be created." | ||
SubnetIds: | ||
default: "Subnet IDs where database will have connections. Recommend to host in private subnets." | ||
|
||
############### | ||
## Parameters | ||
Parameters: | ||
ClusterName: | ||
Description: Database Cluster Name | ||
Type: String | ||
Default: "slurm-accounting-cluster" | ||
MinLength: 1 | ||
MaxLength: 63 | ||
AllowedPattern: ^[a-z][-a-z0-9]{0,62}$ | ||
ConstraintDescription: >- | ||
Cluster name must be between 1 and 63 characters, start with a lower case character, and be followed by a mix of | ||
lower case characters, digits, and - (hyphens). | ||
ClusterAdmin: | ||
Description: Administrator user name. | ||
Type: String | ||
Default: clusteradmin | ||
MinLength: 3 | ||
MaxLength: 64 | ||
MinCapacity: | ||
Description: Must be less than the maximum capacity. | ||
Type: Number | ||
Default: 1 | ||
MinValue: .5 | ||
MaxValue: 127.5 | ||
MaxCapacity: | ||
Description: Must be greater than or equal to the minimum capacity. | ||
Type: Number | ||
Default: 4 | ||
MinValue: 1 | ||
MaxValue: 128 | ||
VpcId: | ||
Description: VPC ID in which the cluster is deployed. | ||
Type: AWS::EC2::VPC::Id | ||
SubnetIds: | ||
Description: Subnets in which database will be reachable. | ||
Type: List<AWS::EC2::Subnet::Id> | ||
|
||
############################ | ||
## Database Resources | ||
Transform: AWS::Serverless-2016-10-31 | ||
Resources: | ||
########### | ||
# Generate Password and Store in SecretsManager | ||
AccountingClusterAdminSecret: | ||
Type: AWS::SecretsManager::Secret | ||
Properties: | ||
Description: 'Serverless Database Cluster Administrator Password' | ||
GenerateSecretString: | ||
PasswordLength: 16 | ||
ExcludeCharacters: '#"@/\' | ||
Tags: | ||
- Key: 'parallelcluster:usecase' | ||
Value: 'slurm accounting' | ||
|
||
########### | ||
# Database network | ||
AccountingClusterParameterGroup: | ||
Type: 'AWS::RDS::DBClusterParameterGroup' | ||
Properties: | ||
Description: Cluster parameter group for aurora-mysql | ||
Family: aurora-mysql8.0 | ||
Parameters: | ||
require_secure_transport: 'ON' | ||
innodb_lock_wait_timeout: '900' | ||
Tags: | ||
- Key: 'parallelcluster:usecase' | ||
Value: 'slurm accounting' | ||
AccountingClusterSubnetGroup: | ||
Type: 'AWS::RDS::DBSubnetGroup' | ||
Properties: | ||
DBSubnetGroupDescription: !Sub 'Subnets for AccountingCluster-${AWS::Region} database' | ||
SubnetIds: !Ref SubnetIds | ||
Tags: | ||
- Key: 'parallelcluster:usecase' | ||
Value: 'slurm accounting' | ||
AccountingClusterSecurityGroup: | ||
Type: 'AWS::EC2::SecurityGroup' | ||
Properties: | ||
GroupDescription: RDS security group | ||
SecurityGroupEgress: | ||
- CidrIp: 0.0.0.0/0 | ||
Description: Allow all outbound traffic by default | ||
IpProtocol: '-1' | ||
Tags: | ||
- Key: 'parallelcluster:usecase' | ||
Value: 'slurm accounting' | ||
VpcId: !Ref VpcId | ||
AccountingClusterSecurityGroupInboundRule: | ||
Type: 'AWS::EC2::SecurityGroupIngress' | ||
Properties: | ||
IpProtocol: tcp | ||
Description: Allow incoming connections from client security group | ||
FromPort: !GetAtt | ||
- AccountingCluster | ||
- Endpoint.Port | ||
GroupId: !GetAtt | ||
- AccountingClusterSecurityGroup | ||
- GroupId | ||
SourceSecurityGroupId: !GetAtt | ||
- AccountingClusterClientSecurityGroup | ||
- GroupId | ||
ToPort: !GetAtt | ||
- AccountingCluster | ||
- Endpoint.Port | ||
|
||
########### | ||
# Database Cluster | ||
AccountingCluster: | ||
Type: 'AWS::RDS::DBCluster' | ||
Properties: | ||
DBClusterIdentifier: !Ref ClusterName | ||
Engine: "aurora-mysql" | ||
EngineVersion: "8.0.mysql_aurora.3.07.1" | ||
CopyTagsToSnapshot: true | ||
DBClusterParameterGroupName: !Ref AccountingClusterParameterGroup | ||
DBSubnetGroupName: !Ref AccountingClusterSubnetGroup | ||
EnableHttpEndpoint: false | ||
MasterUsername: !Ref ClusterAdmin | ||
MasterUserPassword: !Sub "{{resolve:secretsmanager:${AccountingClusterAdminSecret}}}" | ||
ServerlessV2ScalingConfiguration: | ||
MaxCapacity: !Ref MaxCapacity | ||
MinCapacity: !Ref MinCapacity | ||
StorageEncrypted: true | ||
Tags: | ||
- Key: 'parallelcluster:usecase' | ||
Value: 'slurm accounting' | ||
VpcSecurityGroupIds: | ||
- !GetAtt | ||
- AccountingClusterSecurityGroup | ||
- GroupId | ||
UpdateReplacePolicy: Delete | ||
DeletionPolicy: Delete | ||
AccountingClusterInstance1: | ||
Type: 'AWS::RDS::DBInstance' | ||
Properties: | ||
DBInstanceClass: 'db.serverless' | ||
DBClusterIdentifier: !Ref AccountingCluster | ||
DBInstanceIdentifier: !Sub '${ClusterName}-instance-1' | ||
Engine: "aurora-mysql" | ||
PubliclyAccessible: false | ||
UpdateReplacePolicy: Delete | ||
DeletionPolicy: Delete | ||
AccountingClusterClientSecurityGroup: | ||
Type: 'AWS::EC2::SecurityGroup' | ||
Properties: | ||
GroupDescription: Security Group to allow connection to Serverless DB Cluster | ||
Tags: | ||
- Key: 'parallel-cluster:usecase' | ||
Value: 'slurm accounting' | ||
VpcId: !Ref VpcId | ||
AccountingClusterClientSecurityGroupOutboundRule: | ||
Type: 'AWS::EC2::SecurityGroupEgress' | ||
Properties: | ||
GroupId: !GetAtt | ||
- AccountingClusterClientSecurityGroup | ||
- GroupId | ||
IpProtocol: tcp | ||
Description: Allow incoming connections from PCluster | ||
DestinationSecurityGroupId: !GetAtt | ||
- AccountingClusterSecurityGroup | ||
- GroupId | ||
FromPort: !GetAtt | ||
- AccountingCluster | ||
- Endpoint.Port | ||
ToPort: !GetAtt | ||
- AccountingCluster | ||
- Endpoint.Port | ||
|
||
###################### | ||
# Outputs | ||
Outputs: | ||
ClusterName: | ||
Value: !Ref ClusterName | ||
DatabaseHost: | ||
Value: !GetAtt | ||
- AccountingCluster | ||
- Endpoint.Address | ||
DatabasePort: | ||
Value: !GetAtt | ||
- AccountingCluster | ||
- Endpoint.Port | ||
DatabaseAdminUser: | ||
Value: !Ref ClusterAdmin | ||
DatabaseSecretArn: | ||
Value: !Ref AccountingClusterAdminSecret | ||
DatabaseClusterSecurityGroup: | ||
Value: !GetAtt | ||
- AccountingClusterSecurityGroup | ||
- GroupId | ||
DatabaseClientSecurityGroup: | ||
Value: !GetAtt | ||
- AccountingClusterClientSecurityGroup | ||
- GroupId |