Machine Learning: Hyperparameter tuning, Regularization and Optimization Week1 Notes

Posted by Tianle on September 20, 2017

Setting up your Machine Learning Application

Train/Dev/Test sets

  • For small number of Data, 70/30 or 60/20/20, but for large data say 1 million, dev and test sets just need to be big enough for evaluation. In this case, the percentage can be only 1% or even 0.5% (around 10,000)
  • Make sure the dev and test sets come from the same distribution

Bias/Variance

Assumption:

  • Human prediction(optimal) error is nearly 0%
  • Training and dev sets are drawn from the same distribution

Result:

  • High Bias(underfitting): Training set error is HIGH; Dev set error is HIGH
  • High Variance(overfitting): Training set error is LOW; Dev set error is HIGH
  • High Bias & High Variance: Training set error is HIGH; Dev set error is MUCH HIGHER (15/30%)
  • Low Bias & Low Variance: Training set error is LOW; Dev set error is LOW

Basic Recipe for Machine Learning

graph TD A(Start) --> B{High Bias?} B --> |YES|C[Bigger Network] C --> D[Train Longer] D --> E["(NN Architecture Search)"] E --> B B --> |NO|F{High Variance?} F --> |YES|G[More Data] G --> H[Regularization] H --> I["(NN Architecture Search)"] I --> F F --> |NO|J(End)

Regularizing Your Neural Network

Regularization

  • L2 Regularization: \(||w||_2^2=\sum_{j=1}^{x_n}w_j^2=w^Tw\)
  • Frobenius Norm: \(||w^{[l]}||_F^2=\sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2\)

Dropout Regularization

  • You can apply higher prob for dropout for larger layers and maybe no dropout for small layers

Other Regularization Methods

  • Data Augmentation
  • Early Stopping

Optimization Problem

Normalizing Inputs

  • Subtract Mean: \(\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}\) \(x:=x-\mu\)

  • Normalize Variance \(\sigma^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\) \(x/=\sigma^2\)

Vanishing/Exploding Gradients

If the NN is very deep, any weight which is less than 1 is likely to vanish and any weight which is larger than 1 is likely to explode

Weight Initialization For Deep Networks

  • Relu \(w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{2}{n^{[l-1]}})\)

  • Tanh \(w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{1}{n^{[l-1]}})\)

Numerical Approximation of Gradients

Checking Your Derivative Computation

\(g(\theta)\approx\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}\)

Grandient Checking

Gradient Checking Implementation Notes

  • Don’t use in training - only to debug
  • If algorithm fails grad check, look at components to try to identify bug
  • Remember regularization
  • Doesn’t work with dropout
  • Run at random initialization; perhaps again after some training

blog comments powered by Disqus