# Deep Learning Optimizers Optimization in deep learning is done using backward propagation. The method of using all training data in a giant matrix (to take advantage of vectorization) at the same time is called "batch gradient descent" according to Andrew Ng. The method of breaking up the training samples into separate "mini-batches" during training is called "mini-batch gradient descent". Using a single sample at a time is called [[backward propagation#Gradient Descent|stochastic gradient descent]]. I doubt these terms are universal because [[PyTorch]] and [[Keras]] implementation of gradient descent has no specification of batch size. It maybe a moot point since practically all training is done with "mini-batches" except for the simplest of cases. To recap [[backward propagation]] parameter update is done with the following equation $\theta = \theta -\mu\times{d\theta} $ where $\theta$ is the parameter getting updated and $\mu$ is the learning rate ![[Momentum]] ![[RMSProp]] ![[Adam]] ## Learning Rate Decay $\alpha=\frac{1}{1+decayRate×epochNumber}\alpha_0$ Interesting Read: https://ruder.io/optimizing-gradient-descent/ --- Related [[Deep Learning]]