Momentum - Vivek's Digital Garden

# Momentum Momentum in like adding [[Exponential Weighted Average]] to the gradient term. Andrew Ng's momentum equations are as follows: $V_{d\theta} = \beta_1{V_{d\theta}} + (1-\beta_1)d\theta$ $\theta = \theta -\alpha\times{V_{d\theta}} $ Momentum for deep learning was first explained in the paper by Sutskever et. al. [[On the importance of initialization and momentum in deep learning| paper summary]] [^1] "Classic" momentum is $V_{d\theta}=\beta{V_{d\theta}} - \alpha \times d\theta$ $\theta = \theta +V_{d\theta} $ Nesterov's Momentum is $V_{d\theta}=\beta{V_{d\theta}} -\alpha \times d(\theta + \beta V_{d\theta})$ $\theta = \theta +V_{d\theta} $ PyTorch uses this method $V_{d\theta} = \beta_1{V_{d\theta}} + d\theta$ $\theta = \theta -\alpha\times{V_{d\theta}} $ According to Andrew Ng, in this method changing $\beta$ like this affects $\alpha$. Keras uses this method $V_{d\theta} = \beta_1{V_{d\theta}} -\alpha d\theta$ $\theta = \theta +V_{d\theta} $ When Nesterov=True it becomes $V_{d\theta} = \beta_1{V_{d\theta}} -\alpha d\theta$ $\theta = \theta +\beta_1 \times V_{d\theta} -\alpha d\theta$ $V_{d\theta}$ is also called *"velocity"*. $d\theta$ is the gradient or also called "acceleration". $\beta_1$ is called the momentum factor or "friction". Momentum speeds up movement along directions of strong improvement of loss. Nesterov Momentum points the gradient in the right direction in case the "classic" momentum method overshoots. ![CM_ball vs NAG_ball example](https://i.stack.imgur.com/YrpGA.png) https://stats.stackexchange.com/questions/179915/whats-the-difference-between-momentum-based-gradient-descent-and-nesterovs-acc [^1]. On the importance of initialization and momentum in deep learning, Sutskever, Martens, Dahl, Hinton. https://www.cs.toronto.edu/~fritz/absps/momentum.pdf