Stochastic Gradient Descent (SGD)
Strategy
In contrast to the batch gradient descent, stochastic gradient descent use a “minibatch” of samples to compute the gradients
Common choose of minibatch sample number is 32 / 64 / 128
We use the name "Batch Gradient Descent" to express the gradient descent process using the entire training dataset to calculate gradient
Batch gradient descent is too expensive when having large training set, thus we mostly use SGD
Mathematical Expression
Implementation
Implementation:
# SGD
w = initialize_weights()
for t in range(num_steps):
minibatch = sample_data(data, batch_size)
dw = compute_gradient(loss_fn, minibatch, w)
w -= learning_rate * dwHyperparameters:
- Weight Initialization
- Number of Steps
- Learning Rate
- Batch Size
- Data Sampling
Problem
- Progress slow along shallow dimension, oscillate back and forth in steep direction
- Stop when reaching local minimum and saddle point because of zero gradient (extremely easy to encountered when having many dimensions)
- Since gradient come from minibatch, it can be noisy