Setting up your Machine Learning Application
Train/Dev/Test sets
- For small number of Data, 70/30 or 60/20/20, but for large data say 1 million, dev and test sets just need to be big enough for evaluation. In this case, the percentage can be only 1% or even 0.5% (around 10,000)
- Make sure the dev and test sets come from the same distribution
Bias/Variance
Assumption:
- Human prediction(optimal) error is nearly 0%
- Training and dev sets are drawn from the same distribution
Result:
- High Bias(underfitting): Training set error is HIGH; Dev set error is HIGH
- High Variance(overfitting): Training set error is LOW; Dev set error is HIGH
- High Bias & High Variance: Training set error is HIGH; Dev set error is MUCH HIGHER (15/30%)
- Low Bias & Low Variance: Training set error is LOW; Dev set error is LOW
Basic Recipe for Machine Learning
graph TD
A(Start) --> B{High Bias?}
B --> |YES|C[Bigger Network]
C --> D[Train Longer]
D --> E["(NN Architecture Search)"]
E --> B
B --> |NO|F{High Variance?}
F --> |YES|G[More Data]
G --> H[Regularization]
H --> I["(NN Architecture Search)"]
I --> F
F --> |NO|J(End)
Regularizing Your Neural Network
Regularization
- L2 Regularization: \(||w||_2^2=\sum_{j=1}^{x_n}w_j^2=w^Tw\)
- Frobenius Norm: \(||w^{[l]}||_F^2=\sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2\)
Dropout Regularization
- You can apply higher prob for dropout for larger layers and maybe no dropout for small layers
Other Regularization Methods
- Data Augmentation
- Early Stopping
Optimization Problem
Normalizing Inputs
-
Subtract Mean: \(\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}\) \(x:=x-\mu\)
-
Normalize Variance \(\sigma^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\) \(x/=\sigma^2\)
Vanishing/Exploding Gradients
If the NN is very deep, any weight which is less than 1 is likely to vanish and any weight which is larger than 1 is likely to explode
Weight Initialization For Deep Networks
-
Relu \(w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{2}{n^{[l-1]}})\)
-
Tanh \(w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{1}{n^{[l-1]}})\)
Numerical Approximation of Gradients
Checking Your Derivative Computation
\(g(\theta)\approx\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}\)
Grandient Checking
Gradient Checking Implementation Notes
- Don’t use in training - only to debug
- If algorithm fails grad check, look at components to try to identify bug
- Remember regularization
- Doesn’t work with dropout
- Run at random initialization; perhaps again after some training
blog comments powered by Disqus