Value Iteration Convergence

The Core Insight: Growing Optimality Horizon

Each $Q_{i}$ is optimal for exactly $i$ steps, then reverts to the initial random policy $Q_{0}$

What Each Q_i Represents

$Q_{0}$ : Random actions at every step
$Q_{1}$ : Optimal for 1 step, then random
$Q_{2}$ : Optimal for 2 steps, then random
$Q_{3}$ : Optimal for $i$ steps, then random

The Update Mechanism

$Q_{i + 1} (s, a) = E_{r, s^{'}} [r + γ max_{a^{'}} Q_{i} (s^{'}, a^{'})]$

The nesting creates “telescoping optimality”:

$Q_{i + 1}$ = optimal for 1 step + $γ$ × (optimal for $i$ steps) = optimal for $i + 1$ steps

Why Convergence Happens

Since $γ < 1$ , rewards far in the future matter exponentially less. Eventually $Q_{i}$ becomes optimal for so many steps that the remaining “random tail” contributes negligibly.

Result: As $i \to \infty$ , the optimality horizon grows without bound while the random part vanishes, so $Q_{i} \to Q^{⋆}$ .

Chilfox

目錄

D-DL4CV-Lec21d-Value_Iteration_Convergence

Value Iteration Convergence

The Core Insight: Growing Optimality Horizon

What Each Q_i Represents

The Update Mechanism

Why Convergence Happens

關係圖譜

反向連結