# On the importance of initialization and momentum in deep learning Ilya Sutskever James Martens George Dahl Geoffry Hinton Link: https://www.cs.toronto.edu/~fritz/absps/momentum.pdf Related: [[Momentum]][[Initialization]][[Deep Learning]] The authors find that good initialization and momentum works surprisingly well as good as the Hessian Free optimization which seemed to have produced the cutting edge results at the time. Hessian Free has seemed to have fallen on the wayside because it is not very scalable and is complex to implement. ## Momentum There were previous studies about momentum that did not think that it was a big benefit. However the authors found that those studies were using momentum for a "estimation" problem rather than an "optimization" problem. The estimation problem can happen in the late stage of the learning. But in deep learning problems, the initial "transient" stage learning is usually the bottle neck where both momentum works well. "Classic" momentum is $V_{d\theta}=\beta{V_{d\theta}} - \alpha \times d\theta$ $\theta = \theta +V_{d\theta} $ Nesterov's Momentum is $V_{d\theta}=\beta{V_{d\theta}} -\alpha \times d(\theta + \beta V_{d\theta})$ $\theta = \theta +V_{d\theta} $ Classic momentum is a technique for accelerating gradient descent that accumulates velocity vector in the direction of persistent reduction in the objective across iterations. While the classic momentum computes gradients at current position $\theta_t$ Nosterov's first applies the momentum and then calculates the gradient at the new position $\theta + \mu v_t$ where $\mu$ is the momentum coefficient. This difference allows the Nosterov's method to be more responsive and stable. The authors seem to have used a momentum schedule with a cap $\mu_t = min(1-2^{-1-log_2(t/250)+1}, \mu_{max}) $ ![[Pasted image 20210102172944.png]] They found reducing the momentum factor to 0.9 for last 1000 iterations helps improving the solution because of a shift from "optimization" to "estimation". The total number of iterations are 750,000. They found that keeping a high value of $\mu$ for most of the training process was better than using a lower value throughout. They think this is because the high value of $\mu$ allows the momentum to carry the solution through plateaus and even though the solution is making good progress, the error reduction isn't noticeable. Eventually the solution reaches a new region or a closer proximity to the local optimum. ## Initialization The initialization technique used was "sparse initialization" where weights were drawn from a unit Gaussian distribution and biases were set to zero. The initial total output of each layer will not depend on the size of previous layers. When using tanh activation, they transform the weights by setting the biases to 0.5 and rescaling the weights by 0.25. ## Recurrent Neural Networks I will come back to this section when I understand it better.