Value Iteration Convergence

The Core Insight: Growing Optimality Horizon

Each is optimal for exactly steps, then reverts to the initial random policy

What Each Q_i Represents

  • : Random actions at every step
  • : Optimal for 1 step, then random
  • : Optimal for 2 steps, then random
  • : Optimal for steps, then random

The Update Mechanism

The nesting creates “telescoping optimality”:

  • = optimal for 1 step + × (optimal for steps) = optimal for steps

Why Convergence Happens

Since , rewards far in the future matter exponentially less. Eventually becomes optimal for so many steps that the remaining “random tail” contributes negligibly.

Result: As , the optimality horizon grows without bound while the random part vanishes, so .