# RMSProp
RMSProp keeps a (exponential) running average of the square of gradients and then divides it by the square root of the running average to update the weights as shown n the equations below
$S_{d\theta} = \beta_2 S_{d\theta} +(1-\beta_2)d\theta^2$
$\theta = \theta - \alpha \frac{d\theta}{\sqrt{S_{d\theta} + \epsilon}}$
where,
$\alpha$ is the learning rate, $\beta_2$ is the factor for rmsprop and $\epsilon$ is a small value added so the denominator doesn't become zero.
## RProp
The intuition why this works comes from "RProp" method where the magnitude of the gradient is ignored and only the sign of the gradient is used.
$\theta = \theta - \alpha \frac{d\theta}{\lvert{d\theta}\rvert}$
If the previous gradient and current gradient have the same sign then the learning rate is accelerated by multiplying by number greater than 1. else we might have jumped over a local minima and hence it is reduced.
$\alpha = \begin{cases}
1.2\alpha \text{ if } sgn(\theta_{t-1}) = sgn(\theta_{t}) \\
0.5\alpha \text{ if } sgn(\theta_{t-1}) \ne sgn(\theta_{t})
\end{cases}$
$\alpha$ is also limited within a max and min bounds.
Problem with this approach is that it doesn't work well for mini-batch training. It can work with full batch training. The reason it doesn't work with mini-batch training is that the error from a particularly bad batch is treated the same as an error from a slightly bad batch. RMSProp avoids this issue.
---
Related: [[Optimizers]]