Hyperparameter Tuning, Regularization, Optimization

# Improving Deep Neural Networks The course provides the fundamentals for practical aspects of deep learning that any deep learning practitioner needs to know. Building successful deep learning networks is something of an art. However, there are techniques that have proven successful and have enabled the huge growth in the past decade. The techniques described here are a starting point that you need to build on for your specific application. ## Basic Recipe for Neural networks 1. Train/dev/test split 2. Start with existing architecture where possible 3. Start with as much data as possible. 10-100x number of features/parameters 4. Always normalize 5. Always regularize 1. Try to [[Augmentation|augment]] data with some transforms/jitter 6. Try to overfit 1. High bias (underfitting) - more complex network 2. High variance (overfitting) - get more data 7. Iterate ## Before you Begin ### Do I have enough data The only way to tell if you have enough data is by checking whether you meet your goals for accuracy. Even that isn't a sure thing because you do not know if you have covered the entire range you will encounter in production of if the network has a bias. Having said that if you are over fitting (high variance) your data, it is usually a good thing to get more data. If you are underfitting (high bias) see if you can try a more complicated network ### Train/Dev/Test Split In deep learning world it is best to use 98% of data for training, 1% for dev andn 1% for test. Dev is a cross validation set that is used during the training of the network. Test is not used until the network building is complete. It maybe ok to not have a test set at all. It is ok to include data not from the intended population during the training. But the test/dev sets should be from the same population as production. ### Bias/Variance A model that is overfitting is said to have high variance. A model that is underfitting is said to have high bias. Following table shows various scenarios and the diagnosis of them. | Error | Scenario 1 | Scenario 2 | Scenario 3 | | --------------------- | ------------- | ---------- | ---------- | | Train Set Error | 1% | 15% | 15% | | Dev Set Error | 11% | 16% | 16% | | Optimal (Bayes) Error | 0% | 0% | 14% | | Diagnosis | High Variance | High Bias | Good model | ### Normalization It is important to normalize the data so the mean ~0 and data range is ~(-1,1). Having one feature in 100s and another in single digits is a recipe for disaster. ![[Regularization]] ![[Initialization]] ## Gradient Checking If you are implementing your own gradient calculations for backprop, you might want to double check whether the gradients are correct. This should not be done during training. Ng shows a simple technique that could be used. However, since most frameworks do the backprop automatically, this is not usually needed. ![[Optimizers]] ![[Batch Normalization]] ![[Softmax]] ## References: [^1] Improving neural networks by preventingco-adaptation of feature detectors, Hinton, et. al https://arxiv.org/pdf/1207.0580.pdf #📖 [^2] Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification https://arxiv.org/pdf/1502.01852.pdf [^3] Understanding the difficulty of training deep feedforward neural networks http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf [^4] Efficient Backprop LeCun et al. https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf --- Related: [[Deep Learning]]