Value Iteration Convergence
The Core Insight: Growing Optimality Horizon
Each is optimal for exactly steps, then reverts to the initial random policy
What Each Q_i Represents
- : Random actions at every step
- : Optimal for 1 step, then random
- : Optimal for 2 steps, then random
- : Optimal for steps, then random
The Update Mechanism
The nesting creates “telescoping optimality”:
- = optimal for 1 step + × (optimal for steps) = optimal for steps
Why Convergence Happens
Since , rewards far in the future matter exponentially less. Eventually becomes optimal for so many steps that the remaining “random tail” contributes negligibly.
Result: As , the optimality horizon grows without bound while the random part vanishes, so .