You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello @btahir,
I know my answer is late but for those with the same question here is the answer :
In the original paper, see algorithm 1, they use
instead of m_t (same for v_t) to compute the update. Thus :
Thanks.
Awesome work! The adam.py implementation is very useful! I just had a quick question about:
alpha_t = tf.sqrt(1 - beta2_power) / (1 - beta1_power)
What does this achieve? Wouldn't the original Adam implementation be?
var - lr * m_t / (tf.sqrt(v_t) + eps)
rather than
var - lr * (m_t*alpha_t) / (tf.sqrt(v_t) + eps)
Are you adding weight decay with alpha_t maybe?
The text was updated successfully, but these errors were encountered: