-
Notifications
You must be signed in to change notification settings - Fork 3
Running on the HPCC
Jessie Micallef edited this page May 22, 2020
·
1 revision
To create the processing scripts, you will need IceTry access in the IceCube software to convert the i3 formatted files. The rest of the components use hdf5 files, so only a python environment is needed for remaining parts of processing. To train the network, Keras with a Tensorflow backend is needed. Tensorflow suggests using anaconda to install, though anaconda does not play nice with the IceCube metaprojects. Since the processing steps are separated, you can load difference environment for different steps.
- Create Training scripts (i3 --> hdf5)
- Need IceCube software!
- Option 1: cvmfs
-
source /mnt/home/micall12/setup_combo_stable.sh
. Does the following steps:eval /cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/setup.sh
module purge
/cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/RHEL_7_x86_64/metaprojects/combo/stable/env-shell.sh
-
- Option 2: singularity container
singularity exec -B /mnt/home/micall12:/mnt/home/micall12 -B /mnt/research/IceCube:/mnt/research/IceCube --nv /mnt/research/IceCube/Software/icetray_stable-tensorflow.sif python ...
- Must replace your home netID for
micall12
- Can also start a singularity shell, then run scripts interactively inside
- CNN Training and Testing:
- Need tensorflow, keras, and python!
- Option 1: singularity container
singularity exec -B /mnt/home/micall12:/mnt/home/micall12 -B /mnt/research/IceCube:/mnt/research/IceCube --nv /mnt/research/IceCube/Software/icetray_stable-tensorflow.sif python ...
- Must replace your home netID for
micall12
- Can also start a singularity shell, then run scripts interactively inside
- Advantage of this option: can send container to any cluster with code and should be able to run
- Disadvantage of option: container is static, difficult to update software
- Option 2: anaconda
- Install anaconda
- Create a virtual env
conda create -n tfgpu
- Go into virtual env
conda activate tfgpu
- Add necessary libraries:
pip install tensorflow
pip install keras
pip install matplotlib
- Advantage of this option: easier to update, not a "static container"
- Disadvantage of option: Tensorflow's GPU interaction has been known to stop working suddenly on HPCC, and the only solution found so far is to reinstall anaconda and then recreate virtual env
- Example job submission scripts in
make_jobs
- Create HDF5: Making the training files (i3-->hdf5)
- Most efficient to run in parallel
- Can glob, but concat step takes a while
- Each file takes a few minutes only
-
create_job_files_single_training.sh
makes a job script for every file in the specified folder -
job_template...
should have all the flags/args you want forcreate_training
code - You can submit all these as jobs or run them locally with bash
- Most efficient to run in parallel
- Run CNN: Training the CNN
- Use
singularity
container to run CNN - Kill and recall tensorflow script every handful of epochs (memory leak that adds ~2min per epoch otherwise)
- STEPS should correspond to the number of files in your data set
- Assumes there is a folder called
output_plots
in your main directory - Should request GPU and about 27G
- Use
FLERCNN by J. Micallef