Stochastic Gradient Descent (SGD)

Strategy

In contrast to the batch gradient descent, stochastic gradient descent use a “minibatch” of samples to compute the gradients

Common choose of minibatch sample number is 32 / 64 / 128

We use the name "Batch Gradient Descent" to express the gradient descent process using the entire training dataset to calculate gradient

Batch gradient descent is too expensive when having large training set, thus we mostly use SGD

Mathematical Expression

x_{t + 1} = x_{t} - α \nabla f (x_{t})

Implementation

Implementation:

# SGD
w = initialize_weights()
for t in range(num_steps):
	minibatch = sample_data(data, batch_size)
	dw = compute_gradient(loss_fn, minibatch, w)
	w -= learning_rate * dw

Hyperparameters:

Weight Initialization
Number of Steps
Learning Rate
Batch Size
Data Sampling

Problem

Progress slow along shallow dimension, oscillate back and forth in steep direction
Stop when reaching local minimum and saddle point because of zero gradient (extremely easy to encountered when having many dimensions)
Since gradient come from minibatch, it can be noisy

Chilfox

目錄

D-DL4CV-Lec04ba-SGD

Stochastic Gradient Descent (SGD)

Strategy

Mathematical Expression

Implementation

Problem

關係圖譜

反向連結