Steps

Step 1: Check initial loss

Depends on the way we initialize the parameters, we expect the loss come in a certain range. We can thus check if the initial loss is reasonable before further training

For example, in lecture 3, we expect SVM loss with random weight initialization to have as initial loss, where is the number of classes

Step 2: Overfit a small sample

Strategy

This is the debugging step: If our model can’t overfit a tiny sample, something fundamentally broken

We’ll try to train 100% training accuracy on a small sample of training data (~5-10 minibatches); experiment with architecture, LR, weight initialization till things works

Guideline

Loss not going down:

  1. LR too low
  2. bad initialization

Loss explodes to Inf or NaN:

  1. LR too high
  2. bad initialization

or it may be bug in your code

Turn off regularization here

Step 3: Find LR that makes loss go down

Now, use the full training data instead of small samples. We want to find LR that makes loss drops significantly within ~100 iterations

This is also a debugging step, which confirm our architecture determined in step 2 works fine in large dataset

Good learning rate to try: 1e-1, 1e-2, 1e-3, 1e-4

Step 4: Coarse grid, train for ~1-5 epochs

Choose some value of LR and weight decay around what worked from Step 3, train a few models for ~1-5 epochs

Good weight decay to try: 1e-4, 1e-5, 0

Step 5: Refine grid, train longer

Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay

Step 6: Look at learning curves

We want to look at learning curves, utilizing the info given by the training process, adjust the model, then back to Step 5


Common Problem Detection

Learning Curves

Bad initialization

This may happen because we initialize the weights near regions where gradients are zero

Loss plateaus

This situation may tells us that the model overshoot the minimum repeatedly, thus we may want to apply learning rate decay

Learning rate step decay

This suggests that we apply learning rate decay too early

Accuracy Curve

Good curve, but need longer training

Overfitting, you may want more data or increase regularization

Underfitting, you may want to train longer or use a bigger model

Weight Update / Weight Magnitude

We want the ratio given below to be around 0.001