Batch Normalization - Vivek's Digital Garden

# Batch Normalization Batch normalization works by "normalizing" the output of a layer (before or after the activation function) $Z^{[i]}_norm = \gamma \frac{Z^{[i]}-\mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$ $\gamma$ and $\beta$ are learnable parameters that are updated using gradient descent. Note that $\beta$ is different from the notation used for [[Momentum]] and [[RMSProp]]. Batch norm layer can be added before or after the activation. Andrew Ng says it is typically added before the activation. Also note that when you add batch norm, the bias is canceled out. So you might as well create the layer without the bias term. It is easily done using `bias=False` option in [[PyTorch]] or `use_bias=False` option in [[Keras]] Batch norm also has a regularization effect. However, don't rely solely on this for regularization. Use stuff like weight decay. When batch norm layer is used in prediction of a single input, there is no mean ore standard deviation to calculate, hence the values from the training time are stored and used. The technique was published by Sergey Ioffe and Christian Szegedy[^1] ## Why does batch norm work? Batch norm helps protect against [[Covariant Shift]] [^1]. https://arxiv.org/pdf/1502.03167.pdf