Regularization - Vivek's Digital Garden

# Regularization Regularization is the process of reducing overfitting by constraining the network. One way of doing it is by keeping the weights from becoming too large. Regularization is not usually applied on biases. Here are some techniques for regularization. ## L2 Regularization aka weight decay This is done by adding a "penalty" term to the loss function as follows: $J(w,b) = \frac{1}{m}\sum_{i=1}^lL(\hat{y},y)+\frac{\lambda}{2m}\sum_{i=1}^l\lVert{w}\rVert_F^2$ Note $\lVert{w}\rVert_F^2$ is the **square** of the *Frobenius Norm*. Where Frobenius Norm is $\lVert{w}\rVert_F = \sqrt{\sum_{i=1}^{n_l}\sum_{j=1}^{n_{l-1}}w_{ij}^2}$ $\lambda$ is a regularization hyperparameter (a number like 0.01). Larger the value, more the regularization effect. L1 Regularization is also an option although doesn't seem to be default on most optimizers in [[Pytorch]] ## Dropout Regularization Dropout regularization is the method of randomly eliminating nodes based on a probability parameter and training the network. Key thing to keep i mind is that when you have dropout enabled the cost function may not descend as it normally does. Try without dropout before using dropout. [[Research Paper]] http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf #📖 ### Inverted Dropout Inverted dropout is just the term Hinton et. al. used[^1]. Instead saying probability of dropout $p$, they use "keep probability" $q = 1-p$. Important thing is that outputs of an layer are scaled by a factor $\frac{1}{q}$ during the training. In testing no dropouts are used. ## Early Stopping Early stopping is going back in epochs and getting the solution when training loss continued to drop by test loss started increasing. ## Data Augmentation Data Augmentation has a regularizing effect because it forces the network not to focus too much on a specific feature. Some form of jitter/transformation is usually added to the data when augmented. ## Max Norm A form of regularization that is effected by enforcing an upper bound on the L2 norm of the weight vector at each neuron. Typical values of max norm are on the orders of 3 or 4. It prevents an network from exploding even with high learning rate.