Nesterov Momentum

Strategy

Instead of deciding where to go only based on current situation (compute_gradient(w)), this approach choose to “look ahead” first then decide what momentum I want to impose on the current momentum, i.e., $\nabla f (x_{t} + ρ v_{t})$

Mathematical Expression

v_{t + 1} x_{t + 1} = ρ v_{t} - α \nabla f (x_{t} + ρ v_{t}) = x_{t} + v_{t + 1}

However, we want to update in terms of $x_{t}$ and $\nabla f (x_{t})$ , thus by change of variable ( $\tilde{x}_{t} = x_{t} + ρ v_{t}$ ) and some rearrangement, we have

v_{t + 1} x_{t + 1} = ρ v_{t} - α \nabla f (\tilde{x}_{t}) = \tilde{x}_{t} - ρ v_{t} + (1 + ρ) v_{t + 1} = \tilde{x}_{t} + v_{t + 1} + ρ (v_{t + 1} - v_{t})

Implementation

v = 0
for t in range(num_steps):
	dw = compute_gradient(w)
	old_v = v
	v = rho * v - learning_rate * dw
	w -= rho * old_v - (1 + rho) * v

Chilfox

目錄

D-DL4CV-Lec04bc-Nesterov_Momentum

Nesterov Momentum

Strategy

Mathematical Expression

Implementation

關係圖譜

反向連結