# Deep Learning Initialization Initialization of weights should be random numbers close to 1 and *never* all zeros. Biases can be initialized to zero or a small number like 0.01. ```python W = np.random.randn(n_current_layer, n_prev_layer) ``` ## 1. Vanishing and Exploding Gradients Suppose you have deep neural network and all the weights are initialized to ~1.5. Suppose all your activations are identity activations, then the initial prediction would grow as you go from left to right since they multiply on top of each other. Same problem happens with the gradients on the back prop. This is called exploding gradient. Vanishing gradient happens if weights are initialized with a decimal number like 0.5. To avoid this situation weights need to be initialized as close to 1 as possible. They cannot be all 1 because training wont happen. ## 2. Initialization Techniques To solve vanishing/exploding gradient problem set the variance of the weights at a layer equals $\frac{1}{n}$ where n is the number of units in the previous (input) layer. ```python w = np.random.randn(n_current_layer, n_prev_layer) * np.sqrt(1/n_prev_layer) ``` For ReLU $\frac{2}{n}$ works better than $\frac{1}{n}$ He et. all [^2] Multiplying by $\sqrt{\frac{2}{n_{l-1}}}$ is called "Xavier" Initialization after Xavier Glorot [^3] Multiplying by $\sqrt{\frac{2}{n_{l-1}+n_{l}}}$ is also used. Glorot & Bengio[^3] Sparse Initialization - set all weights to zero but break symmetry by connecting a random number of neurons to a fixed number of neurons below it