Skip to content

Correct normalization scheme; deprecate `batch_size`

Compare
Choose a tag to compare
@OverLordGoldDragon OverLordGoldDragon released this 13 Jul 18:56
· 7 commits to master since this release
a99d833

Existing code normalized as: norm = sqrt(batch_size / total_iterations), where total_iterations = (number of fits per epoch) * (number of epochs in restart). However, total_iterations = total_samples / batch_size --> norm = batch_size * sqrt(1 / (total_iterations_per_epoch * epochs)), making norm scale linearly with batch_size, which differs from authors' sqrt.

Users who never changed batch_size throughout training will be unaffected. (λ = λ_norm * sqrt(b / BT); λ_norm is what we pick, our "guess". The idea of normalization is to make it so that if our guess works well for batch_size=32, it'll work well for batch_size=16 - but if batch_size is never changed, then performance is only affected by the guess.)

Main change here, closing #52.

Updating existing code: for a choice of λ_norm that previously worked well, apply *= sqrt(batch_size). Ex: Dense(bias_regularizer=l2(1e-4)) --> Dense(bias_regularizer=l2(1e-4 * sqrt(32))).