SGD + Momentum
Strategy
Add momentum to original SGD which “memorize” the past steps the algorithm get Use “velocity” term to record past gradients Use “friction” term to reduce influence of ancient gradient to the current step
Mathematical Expression
Implementation
# SGD + Momentum
v = 0
for t in range(num_steps):
dw = compute_gradient(w)
v = rho * v + dw
w -= learning_rate * vThere are different way to implement SGD+Momentum, but they'll give the same sequence of
Resolved Problem
- When reach shallow dimensions, the “velocity” makes it remain reasonable speed
- When enter steep landscape, overshooting create negative momentum to the speed, making next step smaller, which resolve oscillating problem
- When reaching saddle point or local minimum, remaining “velocity” allow us to escape it
New Problem
SGD+Momentum determine the direction to go () by
- The past steps
- The place you currently are However, in our intuition, we should consider instead of when computing Nesterov Momentum we’ll introduce later will solve this problem