DeepPool

Instructions on how to run the VGG example

1. Prepare cluster configuration file

clusterConfig.json file should be ready to start servers. An example is in clusterConfigExample.json.
Upload this configuration file to the server that will serve as the cluster coordinator.
If you use genConfigForAwsExperiment.py, this configuration file will be generated & uploaded automatically to all EC2 servers.

2. Start cluster via cluster.py

Single line command
- python3 cluster.py --addrToBind <this_server's_addr>:<port_to_listen> --c10dBackend nccl
  - Cluster coordinator will listen on <this_server's_addr>:<port_to_listen>. Cluster clients will contact to this address and port for subnitting training jobs. On AWS, make sure that this is a private ip, not a public ip.
- If you use genConfigForAwsExperiment.py, you may copy and paste the last line of stdout.

3. How to check system logs

ssh to the machine that ran cluster.py
logs are in ~/DeepPoolRuntime/logs/
I typically use grep "" logs/*.out and grep "" logs/*.err to check how things are going.

4. Submit VGG12 training job to the cluster coordinator

BE training requires a patched pytorch. Run build_custom_pytorch.sh in the be_training directory to download, patch, compile, and install pytorch.
Build and install the training extension by running python setup.py install in the same directory.
Control batch training using the --be_batch_size=N flag for runtime.py (0 disables training, 16 is the default).

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
aws_ec2_tools		aws_ec2_tools
beModules		beModules
be_training		be_training
csrc		csrc
examples		examples
logs		logs
microbenchmark		microbenchmark
modules		modules
profile		profile
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
cluster.py		cluster.py
clusterClient.py		clusterClient.py
clusterConfigExample.json		clusterConfigExample.json
communication.py		communication.py
genConfigForAwsExperiment.py		genConfigForAwsExperiment.py
genConfigForFastnic.py		genConfigForFastnic.py
gpuProfiler.py		gpuProfiler.py
inceptionLayerGpuProfileA100.txt		inceptionLayerGpuProfileA100.txt
inceptionLayerGpuProfileA100V2.txt		inceptionLayerGpuProfileA100V2.txt
installPythonPackages.sh		installPythonPackages.sh
jobDescription.py		jobDescription.py
logger.py		logger.py
parallelizationPlanner.py		parallelizationPlanner.py
resnetLayerGpuProfileA100.txt		resnetLayerGpuProfileA100.txt
resnetLayerGpuProfileA100V2.txt		resnetLayerGpuProfileA100V2.txt
runnableModule.py		runnableModule.py
runtime.py		runtime.py
timetrace.py		timetrace.py
vggLayerGpuProfileA100.txt		vggLayerGpuProfileA100.txt
vitLayerGpuProfileA100.txt		vitLayerGpuProfileA100.txt
wrnLayerGpuProfileA100V2.txt		wrnLayerGpuProfileA100V2.txt