RMSProp - Vivek's Digital Garden

# RMSProp RMSProp keeps a (exponential) running average of the square of gradients and then divides it by the square root of the running average to update the weights as shown n the equations below $S_{d\theta} = \beta_2 S_{d\theta} +(1-\beta_2)d\theta^2$ $\theta = \theta - \alpha \frac{d\theta}{\sqrt{S_{d\theta} + \epsilon}}$ where, $\alpha$ is the learning rate, $\beta_2$ is the factor for rmsprop and $\epsilon$ is a small value added so the denominator doesn't become zero. ## RProp The intuition why this works comes from "RProp" method where the magnitude of the gradient is ignored and only the sign of the gradient is used. $\theta = \theta - \alpha \frac{d\theta}{\lvert{d\theta}\rvert}$ If the previous gradient and current gradient have the same sign then the learning rate is accelerated by multiplying by number greater than 1. else we might have jumped over a local minima and hence it is reduced. $\alpha = \begin{cases} 1.2\alpha \text{ if } sgn(\theta_{t-1}) = sgn(\theta_{t}) \\ 0.5\alpha \text{ if } sgn(\theta_{t-1}) \ne sgn(\theta_{t}) \end{cases}$ $\alpha$ is also limited within a max and min bounds. Problem with this approach is that it doesn't work well for mini-batch training. It can work with full batch training. The reason it doesn't work with mini-batch training is that the error from a particularly bad batch is treated the same as an error from a slightly bad batch. RMSProp avoids this issue. --- Related: [[Optimizers]]