SGD + Momentum

Strategy

Add momentum to original SGD which “memorize” the past steps the algorithm get Use “velocity” term to record past gradients Use “friction” term to reduce influence of ancient gradient to the current step

Mathematical Expression

Implementation

# SGD + Momentum
v = 0
for t in range(num_steps):
	dw = compute_gradient(w)
	v = rho * v + dw
	w -= learning_rate * v

There are different way to implement SGD+Momentum, but they'll give the same sequence of

Resolved Problem

  1. When reach shallow dimensions, the “velocity” makes it remain reasonable speed
  2. When enter steep landscape, overshooting create negative momentum to the speed, making next step smaller, which resolve oscillating problem
  3. When reaching saddle point or local minimum, remaining “velocity” allow us to escape it

New Problem

SGD+Momentum determine the direction to go () by

  1. The past steps
  2. The place you currently are However, in our intuition, we should consider instead of when computing Nesterov Momentum we’ll introduce later will solve this problem