# Adam Optimization Adam optimization is a combination of [[RMSProp]] and [[Momentum]]. It is also [[Exponential Weighted Average#Bias correction|bias corrected]]. $V_{d\theta}=\frac{1}{1-\beta^t}(\beta_1 V_{d\theta} + (1-\beta_1)d\theta)$ $S_{d\theta}=\frac{1}{1-\beta^t}(\beta_2 S_{d\theta} + (1-\beta_2)d\theta^2)$ $\theta = \theta - \alpha \frac{V_{d\theta}}{\sqrt{S_{d\theta} + \epsilon}}$ ## AMSGrad The intuition is from the cases where the Adam did not converge and momentum overperformed Adam. It just adds another term just before the parameter update. $S_{d\theta}^t = max(S_{d\theta}^{t-1}, S_{d\theta}^t)$ ## AdamW https://openreview.net/pdf?id=ryQu7f-RZ ## Nadam Instead of the classical momentum it uses a Nesterov momentum. This doesn't seem to exist in either Pytorch or Keras --- Related: [[Optimizers]]