Skip to content

Latest commit

 

History

History
41 lines (32 loc) · 2.22 KB

README.md

File metadata and controls

41 lines (32 loc) · 2.22 KB

DeepPool

Instructions on how to run the VGG example

1. Prepare cluster configuration file

  • clusterConfig.json file should be ready to start servers. An example is in clusterConfigExample.json.
  • Upload this configuration file to the server that will serve as the cluster coordinator.
  • If you use genConfigForAwsExperiment.py, this configuration file will be generated & uploaded automatically to all EC2 servers.

2. Start cluster via cluster.py

  • Single line command
    • python3 cluster.py --addrToBind <this_server's_addr>:<port_to_listen> --c10dBackend nccl
      • Cluster coordinator will listen on <this_server's_addr>:<port_to_listen>. Cluster clients will contact to this address and port for subnitting training jobs. On AWS, make sure that this is a private ip, not a public ip.
    • If you use genConfigForAwsExperiment.py, you may copy and paste the last line of stdout.

3. How to check system logs

  • ssh to the machine that ran cluster.py
  • logs are in ~/DeepPoolRuntime/logs/
  • I typically use grep "" logs/*.out and grep "" logs/*.err to check how things are going.

4. Submit VGG12 training job to the cluster coordinator

  • ssh to any machine that reach the cluster coordinator.
  • Run python3 ~/DeepPoolRuntime/examples/vgg.py
  • Right now, runtimes will only run 1 iteration.

Instructions on automatic setup scripts for AWS

  • Scripts
    • genConfigForAwsExperiment.py
      • genConfigForAwsExperiment.py needs to files: aws-started-publicDnsName.txt and aws-started-privateIps.txt. They can be automatically generated by aws_ec2_tools/startEC2instance.sh
    • aws_ec2_tools
      • Prerequisites
        • aws-cli2 with text mode.
        • Setup a security group which opens all ports within the group.
        • A private key registered in AWS.

Enabling best-effort training (Single GPU Resnet50)

  • BE training requires a patched pytorch. Run build_custom_pytorch.sh in the be_training directory to download, patch, compile, and install pytorch.
  • Build and install the training extension by running python setup.py install in the same directory.
  • Control batch training using the --be_batch_size=N flag for runtime.py (0 disables training, 16 is the default).