Skip to content

Training and Testing Scripts

Jessie Micallef edited this page May 22, 2020 · 1 revision

Typical CNN Training/Testing Procedure

  • Submit CNN_LoadMultipleFiles.py to train for many epochs (100s)
  • Check progress using plot_loss_from_column.py during training
  • Use CNN_TestOnly.py to check results at any time (using oscnext test)
    • When to check results: if validation curve leveling off or want to check results at specific epoch
    • Save PegLeg or Retro test once you have settled on final model

Description of Scripts

  • CNN_LoadMultipleFiles.py - used for training the CNN

    • Takes in multiple training files (of certain file pattern), loads one and trains for an epoch before loading the next file for the next epoch
      • Makes sure not too much data is stored at once (~30G)
      • Shuffles within the file between full file pass sets
      • Expects a train, test, validate set to load data in
    • Learning rate adjustable with parserr args
    • Batch size and dropout currently constant
    • Loss functions:
      • Energy = mean_absolute_percentage_error
      • Zenith = mean_squared_error
      • Track Length = mean_squared_error (NOT optimized)
    • Loads model architecture from cnn_model.py
    • Functionality:
      • Can train for energy or zenith alone
        • parser arg option --variables 1 and --first_variable "zenith" or "energy"
      • Can train for energy, zenith, and/or track at the same time
        • parser arg option --variables 2 or 3
        • Can only do order energy then zenith then track
      • Starts at the given epoch, runs for the number of epochs specified
        • Helps to continue training model if killed (loads weights from given model)
        • Helps to kill and reload tensorflow to avoid memory leak
      • Can plot "test", comparing to oscnext flat test sample
    • Appends loss to saveloss_currentepoch.txt file in output directory
    • Look at make_jobs/run_CNN/ for slurm submission examples and make_jobs_condor/run_CNN/ for HTCondor examples
  • plot_loss_from_column.py - plot loss from column sorted saveloss txt file

    • CNN_LoadMultipleFiles.py output column sorted saveloss txt file
    • File also stores time to train per epoch and per loading data file + training per epoch
    • Order of loss, validation loss, etc. varies on number of variables training for (uses dict keys to pull correct values)
    • Functionality:
      • Can give ylim as ymin and ymax, parser args
      • Can specify which epoch to plot until, to shorten x axis (parser arg)
      • Manually can change number of files to average over and start at
        • Set to 7 files to average over
        • Set to start plotting avg plots at epoch 49
    • Outputs plots to outdir folder
      • TrainingTimePerEpoch.png
      • loss_vs_epochs.png
      • AvgLossVsEpoch.png
      • AvgRangeVsEpoch.png
    • Look at make_jobs/plot_CNN/ for slurm submission examples and make_jobs_condor/plot_CNN/ for HTCondor examples
  • CNN_TestOnly.py - used for testing the CNN

    • Takes in one file
      • Use make_test_file.py to make multiple files into one testonly set
      • See Processing Scripts section of README for more information
    • Evauluates network at given model:
      • Parser arg the directory name where model is stored
      • Parser arg the epoch number of the model to grab
    • Can compare to old reco
      • Parser arg boolean flag --compare_reco
      • Give test name --test PegLeg or "Retro". Use "oscnext for no comparison
    • Need to load in same model as training (cnn_model.py)
    • Creates many plots and outputs to model directory, with subfolder that has the test name and epoch number (gives ability to perform multiple test types on multiple epoch stages)
    • Look at make_jobs/plot_CNN/ for slurm submission examples and make_jobs_condor/plot_CNN/ for HTCondor examples