Releases: warner-benjamin/optimi
v0.2.1: param_groups_weight_decay
Adds param_groups_weight_decay
which excludes bias and normalization layers from weight decay. param_groups_weight_decay
is lightly modified from PyTorch Image Models.
Full Changelog: v0.2.0...v0.2.1
v0.2.0: Gradient Release & Optimizer Accumulation
- Add Gradient Release
- Add Optimizer Accumulation
Full Changelog: v0.1.2...v0.2.0
v0.1.2
Add RAdam and Ranger optimizers.
Full Changelog: v0.1.1...v0.1.2
v0.1.1: Initial Release
optimī
Fast, Modern, and Low Precision PyTorch Optimizers
optimi enables accurate low precision training via Kahan summation, supports fully decoupled weight decay, and features fast implementations of modern optimizers.
Low Precision Training with Kahan Summation
optimi optimizers can match the performance of mixed precision when training in BFloat16 by using Kahan summation.
Training in BFloat16 with Kahan summation can reduce non-activation training memory usage by 37.5 to 45.5 percent when using an Adam optimizer. BFloat16 training increases single GPU training speed by ~10 percent at the same batch size.
Fully Decoupled Weight Decay
In addition to supporting PyTorch-style decoupled weight decay, optimi optimizers also support fully decoupled weight decay.
Fully decoupled weight decay decouples weight decay from the learning rate, more accurately following Decoupled Weight Decay Regularization. This can help simplify hyperparameter tuning as the optimal weight decay is no longer tied to the learning rate.
Foreach Implementations
All optimi optimizers have fast foreach implementations, which can significantly outperform the for-loop versions. optimi reuses the gradient buffer for temporary variables to reduce foreach memory usage.
Documentation
https://optimi.benjaminwarner.dev
Install
optimi is available to install from pypi.
pip install torch-optimi
Optimizers
optimi implements the following optimizers: