Release Correct normalization scheme; deprecate `batch_size` · OverLordGoldDragon/keras-adamw

Existing code normalized as: norm = sqrt(batch_size / total_iterations), where total_iterations = (number of fits per epoch) * (number of epochs in restart). However, total_iterations = total_samples / batch_size --> norm = batch_size * sqrt(1 / (total_iterations_per_epoch * epochs)), making norm scale linearly with batch_size, which differs from authors' sqrt.

Users who never changed batch_size throughout training will be unaffected. (λ = λ_norm * sqrt(b / BT); λ_norm is what we pick, our "guess". The idea of normalization is to make it so that if our guess works well for batch_size=32, it'll work well for batch_size=16 - but if batch_size is never changed, then performance is only affected by the guess.)

Main change here, closing #52.

Updating existing code: for a choice of λ_norm that previously worked well, apply *= sqrt(batch_size). Ex: Dense(bias_regularizer=l2(1e-4)) --> Dense(bias_regularizer=l2(1e-4 * sqrt(32))).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct normalization scheme; deprecate `batch_size`