Releases: OverLordGoldDragon/keras-adamw
TF 2.3.1 compatibility
- Fixed 'L1' object has no attribute 'l2' in TF 2.3.1 (and vice versa for non-
l1_l2
objects) - Moved testing to TF2.3.1
DOI
Adds a DOI for citation purposes
TF 2.3 compatibility
control_dependencies
moved from tensorflow.python.ops
to tensorflow.python.framework.ops
; for backwards-compatibility, edited code to use tf.control_dependencies
.
Further, TF2.3.0 isn't compatible with Keras 2.3.1 and earlier; unsure of later versions, but development proceeds with tf.keras
.
Correct normalization scheme; deprecate `batch_size`
Existing code normalized as: norm = sqrt(batch_size / total_iterations)
, where total_iterations
= (number of fits per epoch) * (number of epochs in restart). However, total_iterations = total_samples / batch_size
--> norm = batch_size * sqrt(1 / (total_iterations_per_epoch * epochs))
, making norm
scale linearly with batch_size
, which differs from authors' sqrt.
Users who never changed batch_size
throughout training will be unaffected. (λ = λ_norm * sqrt(b / BT); λ_norm is what we pick, our "guess". The idea of normalization is to make it so that if our guess works well for batch_size=32
, it'll work well for batch_size=16
- but if batch_size
is never changed, then performance is only affected by the guess.)
Main change here, closing #52.
Updating existing code: for a choice of λ_norm that previously worked well, apply *= sqrt(batch_size)
. Ex: Dense(bias_regularizer=l2(1e-4))
--> Dense(bias_regularizer=l2(1e-4 * sqrt(32)))
.
Add autorestart
FEATURE: autorestart
option which automatically handles Warm Restarts by resetting t_cur=0
after total_iterations
iterations.
- Defaults to
True
ifuse_cosine_annealing=True
, elseFalse
- Must use
use_cosine_annealing=True
if usingautorestart=True
Updated README and example.py
.
Synchronize updates; fix AdamW lr_t (keras)
BUGFIXES:
- Last weight in network would be updated with
t_cur
one update ahead, desynchronizing it from all other weights AdamW
inkeras
(optimizers.py, optimizers_225.py) weight updates were not mediated byeta_t
, so cosine annealing had no effect.
FEATURES:
- Added
lr_t
to tf.keras optimizers to track "actual" learning rate externally; useK.eval(model.optimizer.lr_t)
to get "actual" learning rate for givent_cur
anditerations
- Added
lr_t
vs. iterations plot to README, and source code inexample.py
MISC:
- Added
test_updates
to ensure all weights update synchronously, and thateta_t
first applies on weights as-is and then updates according tot_cur
- Fixes #47
v1.31 Fix `SGDW(momentum=0)` case
BUGFIXES:
SGDW
withmomentum=0
would bug per variable scoping issues; rewritten code is correct and should run a little faster. Files affected:optimizers_v2.py
,optimizers_225tf.py
MISC:
- Added test case for
SGDW(momentum=0)
- Added control test for
SGDW(momentum=0)
vsSGD(momentum=0)
tests/import_selection.py
->tests/backend.py
test_optimizers.py
can now run as__main__
without manually changing paths / working directories
v1.30 TF2.2 compatibility, usage changes, code cleanup
FEATURES:
- Compatibility with TF 2.2 (other versions still compatible, but no longer tested)
eta_t
now behaves deterministically, updating aftert_cur
(previously, behavior was semi-random)- Lots of code cleanup
USAGE NOTES:
t_cur
should now be set to-1
instead of0
to reseteta_t
to 0t_cur
should now be set atiters == total_iterations - 2
; explanation heretotal_iterations
must now be> 1
, instead of only> 0
total_iterations <= 1
will forceweight_decays
andlr_multipliers
toNone
FIXES:
- Optimizers will no longer zero layer penalties if weight decays cannot be applied (i.e.
total_iterations
is not> 1
) eta_t
is now properly updated as atf.Variable
, instead of being an updatetf.Tensor
- Testing didn't actually include Eager in last version - now does
BREAKING:
utils_225tf.py
removedutils_common.py
removedoptimizers_tfpy.py
removedutils.py
code is now that ofutils_225tf.py
utils_common.py
merged withutils.py
self.batch_size
is now anint
, instead oftf.Variable
MISC:
tests
:/test_optimizers
,/test_optimizers_225
,/test_optimizers_225tf
,test_optimizers_v2
,test_optimizers_tfpy
removed- All tests now done in single file:
tests/test_optimizers.py
_update_t_cur_eta_t
and_update_t_cur_eta_t_apply_lr_mult
added toutils.py
- Updated
examples.py
and related parts in README
v1.23 Critical formula fix, performance improvement
BUGFIX:
l1
was being decayed asl2
, and vice versa; formula now correct
FEATURES:
- Performance boost due to including only nonzero decays (
l1
,l2
) in calculations
MISC:
- Rename funcs in
utils_common
, and remove unused kwarg inget_weight_decays
v1.21 Improve optimizer selection, input signature
FEATURES:
from keras_adamw import
now accounts for TF 1.x + Keras 2.3.x casemodel
andzero_penalties
now show up in optimizer constructor input signatures, making them clearer and more Pythonic- Each optimizer now has its own full docstring, instead of deferring to
help(AdamW)
BREAKING:
model
is no longer to be passed as first positional argument, but as a later one, or a keyword argument (model=model
)
BUGFIXES:
name
defaults corrected, many were"AdamW"
even if not AdamW - though no bugs were encountered as a result
MISC:
__init__
wrapper moved inside of__init__
to avoid overriding input signature