Nesterov Momentum

Strategy

Instead of deciding where to go only based on current situation (compute_gradient(w)), this approach choose to “look ahead” first then decide what momentum I want to impose on the current momentum, i.e.,

Mathematical Expression

However, we want to update in terms of and , thus by change of variable () and some rearrangement, we have

Implementation

v = 0
for t in range(num_steps):
	dw = compute_gradient(w)
	old_v = v
	v = rho * v - learning_rate * dw
	w -= rho * old_v - (1 + rho) * v