Skip to content

seojinpark/DeepPool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepPool

Instructions on how to run the VGG example

1. Prepare cluster configuration file

  • clusterConfig.json file should be ready to start servers. An example is in clusterConfigExample.json.
  • Upload this configuration file to the server that will serve as the cluster coordinator.
  • If you use genConfigForAwsExperiment.py, this configuration file will be generated & uploaded automatically to all EC2 servers.

2. Start cluster via cluster.py

  • Single line command
    • python3 cluster.py --addrToBind <this_server's_addr>:<port_to_listen> --c10dBackend nccl
      • Cluster coordinator will listen on <this_server's_addr>:<port_to_listen>. Cluster clients will contact to this address and port for subnitting training jobs. On AWS, make sure that this is a private ip, not a public ip.
    • If you use genConfigForAwsExperiment.py, you may copy and paste the last line of stdout.

3. How to check system logs

  • ssh to the machine that ran cluster.py
  • logs are in ~/DeepPoolRuntime/logs/
  • I typically use grep "" logs/*.out and grep "" logs/*.err to check how things are going.

4. Submit VGG12 training job to the cluster coordinator

  • ssh to any machine that reach the cluster coordinator.
  • Run python3 ~/DeepPoolRuntime/examples/vgg.py
  • Right now, runtimes will only run 1 iteration.

Instructions on automatic setup scripts for AWS

  • Scripts
    • genConfigForAwsExperiment.py
      • genConfigForAwsExperiment.py needs to files: aws-started-publicDnsName.txt and aws-started-privateIps.txt. They can be automatically generated by aws_ec2_tools/startEC2instance.sh
    • aws_ec2_tools
      • Prerequisites
        • aws-cli2 with text mode.
        • Setup a security group which opens all ports within the group.
        • A private key registered in AWS.

Enabling best-effort training (Single GPU Resnet50)

  • BE training requires a patched pytorch. Run build_custom_pytorch.sh in the be_training directory to download, patch, compile, and install pytorch.
  • Build and install the training extension by running python setup.py install in the same directory.
  • Control batch training using the --be_batch_size=N flag for runtime.py (0 disables training, 16 is the default).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages