Skip to content

Commit

Permalink
Add database deployment for Slurm accounting. (#452)
Browse files Browse the repository at this point in the history
* Add accounting fatabse deployment

Close #450

* Update 1.architectures/8.accounting-database/README.md

Co-authored-by: Sean Smith <[email protected]>

* Fix numbering in DB accounting

* Fix database config steps

---------

Co-authored-by: Sean Smith <[email protected]>
  • Loading branch information
mhuguesaws and sean-smith authored Oct 14, 2024
1 parent 32fb5ae commit 86ce468
Show file tree
Hide file tree
Showing 2 changed files with 352 additions and 0 deletions.
128 changes: 128 additions & 0 deletions 1.architectures/8.accounting-database/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Deploy an accounting database for Slurm

This sample setup a Amazon RDS database for HPC and Machine Learning cluster.
You can use for Slurm accounting and generate report of your cluster usage.
For more information you can visit Slurm documentation on [accounting](https://slurm.schedmd.com/accounting.html).

You will need at least two private subnets in different avaibility zones to deploy the database.

## Deploy

Deploy the accounting database using the 1-click deploy:

[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https%3A%2F%2Fawsome-distributed-training.s3.amazonaws.com%2Ftemplates%2Fcf_database-accounting.yaml&stackName=slurm-accounting-database)

**Note** or you can deploy using AWS cli and CloudFormation:
```bash
aws cloudformation deploy --stack-name slurm-accounting-database \
--template-file cf_database-accounting.yaml \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \
--parameter-overrides VpcId=XXX SubnetIds=XXX,XXX
```

## Get database parameters
In this section, you will retrieve the database parameter that are used by Slurm to connect to the accounting database.

### Get the Database URI

Retrieve the `DatabaseHost` to connect to the LDAP User Interface.
```bash
DATABASE_URI=$(aws cloudformation describe-stacks \
--stack-name slurm-accounting-database \
--query 'Stacks[0].Outputs[?OutputKey==`DatabaseHost`].OutputValue' \
--output text)
```
Copy URL into a Web Browser.

### Get the Database Admin User
The database admin user is by default `custeradmin` if you didn't change it on creation.

Get the Database admin user name
```bash
DATABASE_ADMIN=$(aws cloudformation describe-stacks \
--stack-name slurm-accounting-database \
--query 'Stacks[0].Outputs[?OutputKey==`DatabaseAdminUser`].OutputValue' \
--output text)
```

### Get the Database password
The password to access the database was generated randomly and stored in AWS Secret Manager under `AccountingClusterAdminSecre-XXX` output of the cloudformation stack.

Get the Secret ARN
```bash
DATABASE_SECRET_ARN=$(aws cloudformation describe-stacks \
--stack-name slurm-accounting-database \
--query 'Stacks[0].Outputs[?OutputKey==`DatabaseSecretArn`].OutputValue' \
--output text)
```

## Configure AWS ParallelCluster
Starting with version 3.3.0, AWS ParallelCluster supports Slurm accounting with the cluster configuration parameter `SlurmSettings / Database`.

To use the database created previously for accounting, add the following in the `SlurmSettings` section of your cluster configuration file:

```yaml
Database:
Uri: ${DATABASE_URI}:3306
UserName: ${DATABASE_ADMIN}
PasswordSecretArn: ${DATABASE_SECRET_ARN}
CustomSlurmSettings:
# Enable accounting for GPU resources.
# - AccountingStorageTRES: gres/gpu
- AccountingStorageTRES: gres/gpu
```
## Amazon SageMaker HyperPod Orchestrated by Slurm
There are two steps to setup Slurm with the accounting database:
1. Add database configuration file
1. Configure Slurm accounting
### Add database configuration file
You need to execute the following command on the controller node to configure the database connectivity for Slurm.
```bash
cat > /opt/slurm/etc/slurmdbd.conf << EOF
AuthType=auth/munge
DbdHost=$(hostname) # Slurm controller ip address.
DbdPort=6819
SlurmUser=slurm
LogFile=/var/log/slurmdbd.log
StorageType=accounting_storage/mysql
StorageUser=${DATABASE_ADMIN}
StoragePass=$(aws secretsmanager get-secret-value --secret-id ${DATABASE_SECRET_ARN} --query SecretString --output text)
StorageHost=${DATABASE_URI}
StoragePort=3306
EOF
```

### Configure Slurm accounting

```bash
cat >> /opt/slurm/etc/slurm.conf << EOF
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=60
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=$(hostname)
AccountingStorageUser=${DATABASE_ADMIN}
AccountingStoragePort=6819
AccountingStorageTRES=gres/gpu
EOF
```

Restart the slurmctld to pickup the configuration change.
```bash
sudo systemctl restart slurmdctld
sudo scontrol reconfigure
```

For more info how to use Slurm accounting you can read some examples on the [HPC blog](https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/)

## Delete the database
Once you delete your cluster no longer need to keep Slurm acocunting data, you can delete the database.
You can use the command below to delete the AWs CloudFormation stack of the database.
**ALL** accounting will be deleted.

```bash
aws cloudformation delete-stack --stack-name slurm-accounting-database
```
224 changes: 224 additions & 0 deletions 1.architectures/8.accounting-database/cf_database-accounting.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
AWSTemplateFormatVersion: 2010-09-09
Description: >-
This template creates a database for Slurm Accounting.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: "Database Cluster Configuration"
Parameters:
- ClusterName
- ClusterAdmin
- AdminPasswordSecretString
- MinCapacity
- MaxCapacity
ParameterLabels:
ClusterName:
default: "The name of the database cluster"
ClusterAdmin:
default: "The database administrator user name."
AdminPasswordSecretString:
default: "The administrator password."
MinCapacity:
default: "The minimum scaling capacity of the database cluster."
MaxCapacity:
default: "The maximum scaling capacity of the database cluster."
VpcId:
default: "VPC ID in which the cluster will be created."
SubnetIds:
default: "Subnet IDs where database will have connections. Recommend to host in private subnets."

###############
## Parameters
Parameters:
ClusterName:
Description: Database Cluster Name
Type: String
Default: "slurm-accounting-cluster"
MinLength: 1
MaxLength: 63
AllowedPattern: ^[a-z][-a-z0-9]{0,62}$
ConstraintDescription: >-
Cluster name must be between 1 and 63 characters, start with a lower case character, and be followed by a mix of
lower case characters, digits, and - (hyphens).
ClusterAdmin:
Description: Administrator user name.
Type: String
Default: clusteradmin
MinLength: 3
MaxLength: 64
MinCapacity:
Description: Must be less than the maximum capacity.
Type: Number
Default: 1
MinValue: .5
MaxValue: 127.5
MaxCapacity:
Description: Must be greater than or equal to the minimum capacity.
Type: Number
Default: 4
MinValue: 1
MaxValue: 128
VpcId:
Description: VPC ID in which the cluster is deployed.
Type: AWS::EC2::VPC::Id
SubnetIds:
Description: Subnets in which database will be reachable.
Type: List<AWS::EC2::Subnet::Id>

############################
## Database Resources
Transform: AWS::Serverless-2016-10-31
Resources:
###########
# Generate Password and Store in SecretsManager
AccountingClusterAdminSecret:
Type: AWS::SecretsManager::Secret
Properties:
Description: 'Serverless Database Cluster Administrator Password'
GenerateSecretString:
PasswordLength: 16
ExcludeCharacters: '#"@/\'
Tags:
- Key: 'parallelcluster:usecase'
Value: 'slurm accounting'

###########
# Database network
AccountingClusterParameterGroup:
Type: 'AWS::RDS::DBClusterParameterGroup'
Properties:
Description: Cluster parameter group for aurora-mysql
Family: aurora-mysql8.0
Parameters:
require_secure_transport: 'ON'
innodb_lock_wait_timeout: '900'
Tags:
- Key: 'parallelcluster:usecase'
Value: 'slurm accounting'
AccountingClusterSubnetGroup:
Type: 'AWS::RDS::DBSubnetGroup'
Properties:
DBSubnetGroupDescription: !Sub 'Subnets for AccountingCluster-${AWS::Region} database'
SubnetIds: !Ref SubnetIds
Tags:
- Key: 'parallelcluster:usecase'
Value: 'slurm accounting'
AccountingClusterSecurityGroup:
Type: 'AWS::EC2::SecurityGroup'
Properties:
GroupDescription: RDS security group
SecurityGroupEgress:
- CidrIp: 0.0.0.0/0
Description: Allow all outbound traffic by default
IpProtocol: '-1'
Tags:
- Key: 'parallelcluster:usecase'
Value: 'slurm accounting'
VpcId: !Ref VpcId
AccountingClusterSecurityGroupInboundRule:
Type: 'AWS::EC2::SecurityGroupIngress'
Properties:
IpProtocol: tcp
Description: Allow incoming connections from client security group
FromPort: !GetAtt
- AccountingCluster
- Endpoint.Port
GroupId: !GetAtt
- AccountingClusterSecurityGroup
- GroupId
SourceSecurityGroupId: !GetAtt
- AccountingClusterClientSecurityGroup
- GroupId
ToPort: !GetAtt
- AccountingCluster
- Endpoint.Port

###########
# Database Cluster
AccountingCluster:
Type: 'AWS::RDS::DBCluster'
Properties:
DBClusterIdentifier: !Ref ClusterName
Engine: "aurora-mysql"
EngineVersion: "8.0.mysql_aurora.3.07.1"
CopyTagsToSnapshot: true
DBClusterParameterGroupName: !Ref AccountingClusterParameterGroup
DBSubnetGroupName: !Ref AccountingClusterSubnetGroup
EnableHttpEndpoint: false
MasterUsername: !Ref ClusterAdmin
MasterUserPassword: !Sub "{{resolve:secretsmanager:${AccountingClusterAdminSecret}}}"
ServerlessV2ScalingConfiguration:
MaxCapacity: !Ref MaxCapacity
MinCapacity: !Ref MinCapacity
StorageEncrypted: true
Tags:
- Key: 'parallelcluster:usecase'
Value: 'slurm accounting'
VpcSecurityGroupIds:
- !GetAtt
- AccountingClusterSecurityGroup
- GroupId
UpdateReplacePolicy: Delete
DeletionPolicy: Delete
AccountingClusterInstance1:
Type: 'AWS::RDS::DBInstance'
Properties:
DBInstanceClass: 'db.serverless'
DBClusterIdentifier: !Ref AccountingCluster
DBInstanceIdentifier: !Sub '${ClusterName}-instance-1'
Engine: "aurora-mysql"
PubliclyAccessible: false
UpdateReplacePolicy: Delete
DeletionPolicy: Delete
AccountingClusterClientSecurityGroup:
Type: 'AWS::EC2::SecurityGroup'
Properties:
GroupDescription: Security Group to allow connection to Serverless DB Cluster
Tags:
- Key: 'parallel-cluster:usecase'
Value: 'slurm accounting'
VpcId: !Ref VpcId
AccountingClusterClientSecurityGroupOutboundRule:
Type: 'AWS::EC2::SecurityGroupEgress'
Properties:
GroupId: !GetAtt
- AccountingClusterClientSecurityGroup
- GroupId
IpProtocol: tcp
Description: Allow incoming connections from PCluster
DestinationSecurityGroupId: !GetAtt
- AccountingClusterSecurityGroup
- GroupId
FromPort: !GetAtt
- AccountingCluster
- Endpoint.Port
ToPort: !GetAtt
- AccountingCluster
- Endpoint.Port

######################
# Outputs
Outputs:
ClusterName:
Value: !Ref ClusterName
DatabaseHost:
Value: !GetAtt
- AccountingCluster
- Endpoint.Address
DatabasePort:
Value: !GetAtt
- AccountingCluster
- Endpoint.Port
DatabaseAdminUser:
Value: !Ref ClusterAdmin
DatabaseSecretArn:
Value: !Ref AccountingClusterAdminSecret
DatabaseClusterSecurityGroup:
Value: !GetAtt
- AccountingClusterSecurityGroup
- GroupId
DatabaseClientSecurityGroup:
Value: !GetAtt
- AccountingClusterClientSecurityGroup
- GroupId

0 comments on commit 86ce468

Please sign in to comment.