Introduction to ML Strategy
Why ML Strategy
Ideas
- Collect more data
- Collect more diverse traning set
- Train algorithm longer with gradient descent
- Try Adam instead of gradient descent
- Try bigger network
- Try smaller network
- Try dropout
- Add L2 regularization
- Network architecture
- Activation functions
- Number of hidden units
Orthogonalization
Chain of assumptions in ML
- Fit training set well on cost function (Bigger network, Adam)
- Fit dev set well on cost function (Regularization, bigger training set)
- Fit test set well on cost function (Bigger dev set)
- Performs well in real world (Change dev set or cost function)
Setting up your goal
Single number evaluation metric
Using a single number evaluation metric
F1 score = average of Precision and Recall
Harmonic Mean
\[frac{2}{frac{1}{P}+frac{1}{R}}\]Use Dev set + Single real number evaluation metric to speed up iteration
Average
Satisficing and Optimizing metric
Accuracy — optimizing
running time — satisficing
Train/dev/test distributions
Cat classification dev/test sets
Randomly shuffle data into dev/test set so that both dev and test sets have data from all regions
Guideline
Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on
Size of the dev and test sets
Old way of splitting data(smaller data sets)
Train: 70%, Test: 30% Train: 60%, Dev: 20%, Test: 20%
New way
Total: 1,000,000 Train: 98%, Dev: 1%, Test: 1%
Size of test set
Set your test set to be big enough to give high confidence in the overall performance of your system
When to change dev/test sets and metrics
Cat dataset examples
Metric: classification error Algorithm A: 3% error —> pornographic Algorithm B: 5% error
Metric + Dev: Prefer A; You/users: Prefer B
\[Error: frac{1}{m_{dev}}\sum_{i=1}^{M_{dev}}w^{(i)}I\{y_{pred}^{(i)}\neq y^{(i)}\}\] \[w^{(i)}= \begin {cases} 1, & if\ x^{(i)}\ is\ non-porn \\ 10, & if\ x^{(i)}\ is\ porn \end {cases}\]Orthogonalization for cat pictures: anti-porn
- So far we’ve only discussed how to define a metric to evaluate classifiers(place the target).
- Worry separately about how to do well on this metric(aim it).
If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.
Comparing to human-level performance
Why human-level performance?
The accurancy of machine learning model will surpass human accurancy and getting close to the Bayes optimal error(best possible error)
Humans are quite good at lot of tasks. So long as ML is worse than human, you can:
- Get labeled data from humans
- Gain insight from manual error analysus: Why did a person get this right?
- Better analysis of bias/variance
Avoidable bias
Humans (Bayes) 1 % 7.5 % Training Error 8 % 8 % Dev error 10 % 10 % Result Forcus on bias Forcus on variance
Human-level error as a proxy for Bayes error
Understanding human-level performance
Human-level error as proxy for Bayes error
Medical image classification example: Suppose: (a) Typical human 3% error (b) Typical doctor 1% error (c) Experienced doctor 0.7% error (d) Team of experienced doctors 0.5% error
Define human-level error as 0.5% error
Summary of bias/variance with human-level performance
Human-level error ^ ^ |(Bias) (proxy for Bayes error) | "Avoidable bias" V v
Training error
^
| "Variance"
v
Dev error
Surpassing human-level performance
Team of humans 0.5% 0.5% One human 1% 1% Training error 0.6% 0.3% Dev error 0.8% 0.4%
Problems where ML significantly surpasses human-level performance
- Online advertising
- Product recommendations
- Logistics (predicting transit time)
- Loan approvals
Learning structed data; Not natural perception
- Speech recognition
- Some image recognition
- Medical
- ECG, skin cancer, …
Improving your model performance
The two fundamental assumptions of supervised learning
- You can fit the training set pretty well. ~ Avoidable bias
- The training set performance generalizes pretty well to the dev/test set. ~ Variance
Reducing (avoidable) bias and variance
Human-level
| - Train bigger model | Avoidable bias -> - Train longer/better optimization algorithms | - Momentum, RMSprop, Adam | - NN architecture/hyperparameters search (RNN, CNN) v Training error
| - More data | Variance -> - Regularization | - L2, dropout, data augmentation | - NN architecture/hyperparamaters search v Dev error
blog comments powered by Disqus