Correct normalization scheme; deprecate `batch_size`
Existing code normalized as: norm = sqrt(batch_size / total_iterations)
, where total_iterations
= (number of fits per epoch) * (number of epochs in restart). However, total_iterations = total_samples / batch_size
--> norm = batch_size * sqrt(1 / (total_iterations_per_epoch * epochs))
, making norm
scale linearly with batch_size
, which differs from authors' sqrt.
Users who never changed batch_size
throughout training will be unaffected. (λ = λ_norm * sqrt(b / BT); λ_norm is what we pick, our "guess". The idea of normalization is to make it so that if our guess works well for batch_size=32
, it'll work well for batch_size=16
- but if batch_size
is never changed, then performance is only affected by the guess.)
Main change here, closing #52.
Updating existing code: for a choice of λ_norm that previously worked well, apply *= sqrt(batch_size)
. Ex: Dense(bias_regularizer=l2(1e-4))
--> Dense(bias_regularizer=l2(1e-4 * sqrt(32)))
.