Running on the HPCC

Environments

To create the processing scripts, you will need IceTry access in the IceCube software to convert the i3 formatted files. The rest of the components use hdf5 files, so only a python environment is needed for remaining parts of processing. To train the network, Keras with a Tensorflow backend is needed. Tensorflow suggests using anaconda to install, though anaconda does not play nice with the IceCube metaprojects. Since the processing steps are separated, you can load difference environment for different steps.

Create Training scripts (i3 --> hdf5)
- Need IceCube software!
- Option 1: cvmfs
  - source /mnt/home/micall12/setup_combo_stable.sh. Does the following steps:
    - eval /cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/setup.sh
    - module purge
    - /cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/RHEL_7_x86_64/metaprojects/combo/stable/env-shell.sh
- Option 2: singularity container
  - singularity exec -B /mnt/home/micall12:/mnt/home/micall12 -B /mnt/research/IceCube:/mnt/research/IceCube --nv /mnt/research/IceCube/Software/icetray_stable-tensorflow.sif python ...
  - Must replace your home netID for micall12
  - Can also start a singularity shell, then run scripts interactively inside
CNN Training and Testing:
- Need tensorflow, keras, and python!
- Option 1: singularity container
  - singularity exec -B /mnt/home/micall12:/mnt/home/micall12 -B /mnt/research/IceCube:/mnt/research/IceCube --nv /mnt/research/IceCube/Software/icetray_stable-tensorflow.sif python ...
  - Must replace your home netID for micall12
  - Can also start a singularity shell, then run scripts interactively inside
  - Advantage of this option: can send container to any cluster with code and should be able to run
  - Disadvantage of option: container is static, difficult to update software
- Option 2: anaconda
  - Install anaconda
  - Create a virtual env conda create -n tfgpu
  - Go into virtual env conda activate tfgpu
  - Add necessary libraries:
    - pip install tensorflow
    - pip install keras
    - pip install matplotlib
  - Advantage of this option: easier to update, not a "static container"
  - Disadvantage of option: Tensorflow's GPU interaction has been known to stop working suddenly on HPCC, and the only solution found so far is to reinstall anaconda and then recreate virtual env

Submitting Jobs

Example job submission scripts in make_jobs
Create HDF5: Making the training files (i3-->hdf5)
- Most efficient to run in parallel
  - Can glob, but concat step takes a while
  - Each file takes a few minutes only
- create_job_files_single_training.sh makes a job script for every file in the specified folder
- job_template... should have all the flags/args you want for create_training code
- You can submit all these as jobs or run them locally with bash
Run CNN: Training the CNN
- Use singularity container to run CNN
- Kill and recall tensorflow script every handful of epochs (memory leak that adds ~2min per epoch otherwise)
  - STEPS should correspond to the number of files in your data set
- Assumes there is a folder called output_plots in your main directory
- Should request GPU and about 27G

FLERCNN by J. Micallef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on the HPCC

Environments

Submitting Jobs

Clone this wiki locally