1. Prepare cluster configuration file
clusterConfig.json
file should be ready to start servers. An example is inclusterConfigExample.json
.- Upload this configuration file to the server that will serve as the cluster coordinator.
- If you use
genConfigForAwsExperiment.py
, this configuration file will be generated & uploaded automatically to all EC2 servers.
2. Start cluster via cluster.py
- Single line command
python3 cluster.py --addrToBind <this_server's_addr>:<port_to_listen> --c10dBackend nccl
- Cluster coordinator will listen on <this_server's_addr>:<port_to_listen>. Cluster clients will contact to this address and port for subnitting training jobs. On AWS, make sure that this is a private ip, not a public ip.
- If you use
genConfigForAwsExperiment.py
, you may copy and paste the last line of stdout.
3. How to check system logs
- ssh to the machine that ran cluster.py
- logs are in
~/DeepPoolRuntime/logs/
- I typically use
grep "" logs/*.out
andgrep "" logs/*.err
to check how things are going.
4. Submit VGG12 training job to the cluster coordinator
- ssh to any machine that reach the cluster coordinator.
- Run
python3 ~/DeepPoolRuntime/examples/vgg.py
- Right now, runtimes will only run 1 iteration.
- Scripts
- genConfigForAwsExperiment.py
genConfigForAwsExperiment.py
needs to files:aws-started-publicDnsName.txt
andaws-started-privateIps.txt
. They can be automatically generated byaws_ec2_tools/startEC2instance.sh
- aws_ec2_tools
- Prerequisites
- aws-cli2 with text mode.
- Setup a security group which opens all ports within the group.
- A private key registered in AWS.
- Prerequisites
- genConfigForAwsExperiment.py
- BE training requires a patched pytorch. Run
build_custom_pytorch.sh
in thebe_training
directory to download, patch, compile, and install pytorch. - Build and install the training extension by running
python setup.py install
in the same directory. - Control batch training using the
--be_batch_size=N
flag forruntime.py
(0 disables training, 16 is the default).