Skip to content

Running on the HPCC

Jessie Micallef edited this page May 22, 2020 · 1 revision

Environments

To create the processing scripts, you will need IceTry access in the IceCube software to convert the i3 formatted files. The rest of the components use hdf5 files, so only a python environment is needed for remaining parts of processing. To train the network, Keras with a Tensorflow backend is needed. Tensorflow suggests using anaconda to install, though anaconda does not play nice with the IceCube metaprojects. Since the processing steps are separated, you can load difference environment for different steps.

  • Create Training scripts (i3 --> hdf5)
    • Need IceCube software!
    • Option 1: cvmfs
      • source /mnt/home/micall12/setup_combo_stable.sh. Does the following steps:
        • eval /cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/setup.sh
        • module purge
        • /cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/RHEL_7_x86_64/metaprojects/combo/stable/env-shell.sh
    • Option 2: singularity container
      • singularity exec -B /mnt/home/micall12:/mnt/home/micall12 -B /mnt/research/IceCube:/mnt/research/IceCube --nv /mnt/research/IceCube/Software/icetray_stable-tensorflow.sif python ...
      • Must replace your home netID for micall12
      • Can also start a singularity shell, then run scripts interactively inside
  • CNN Training and Testing:
    • Need tensorflow, keras, and python!
    • Option 1: singularity container
      • singularity exec -B /mnt/home/micall12:/mnt/home/micall12 -B /mnt/research/IceCube:/mnt/research/IceCube --nv /mnt/research/IceCube/Software/icetray_stable-tensorflow.sif python ...
      • Must replace your home netID for micall12
      • Can also start a singularity shell, then run scripts interactively inside
      • Advantage of this option: can send container to any cluster with code and should be able to run
      • Disadvantage of option: container is static, difficult to update software
    • Option 2: anaconda
      • Install anaconda
      • Create a virtual env conda create -n tfgpu
      • Go into virtual env conda activate tfgpu
      • Add necessary libraries:
        • pip install tensorflow
        • pip install keras
        • pip install matplotlib
      • Advantage of this option: easier to update, not a "static container"
      • Disadvantage of option: Tensorflow's GPU interaction has been known to stop working suddenly on HPCC, and the only solution found so far is to reinstall anaconda and then recreate virtual env

Submitting Jobs

  • Example job submission scripts in make_jobs
  • Create HDF5: Making the training files (i3-->hdf5)
    • Most efficient to run in parallel
      • Can glob, but concat step takes a while
      • Each file takes a few minutes only
    • create_job_files_single_training.sh makes a job script for every file in the specified folder
    • job_template... should have all the flags/args you want for create_training code
    • You can submit all these as jobs or run them locally with bash
  • Run CNN: Training the CNN
    • Use singularity container to run CNN
    • Kill and recall tensorflow script every handful of epochs (memory leak that adds ~2min per epoch otherwise)
      • STEPS should correspond to the number of files in your data set
    • Assumes there is a folder called output_plots in your main directory
    • Should request GPU and about 27G