Deep Q-Learning

Approach

We want to train a neural network with weight $θ$ that approximate optimal Q-function $Q^{⋆}$

Q^{⋆} (s, a) \approx Q (s, a; θ)

Evaluating the Network

The target $y_{s, a, θ}$ can be calculated by:

y_{s, a, θ} = E_{r, s^{'}} [r + γ a^{'} max Q (s^{'}, a^{'}; θ)]

then we can use it to define the loss

L (s, a) = (Q (s, a; θ) - y_{s, a, θ})^{2}

Why we define the loss this way?

In Bellman Equation, we say when we reach $Q^{⋆}$ , then

Q^{⋆} (s, a) = E_{r, s^{'}} [r + γ a^{'} max Q^{⋆} (s^{'}, a^{'})]

Hence, when $Q$ reach $Q^{⋆}$ , $Q (s, a; θ) = y_{s, a, θ}$

Problem

Target Non-Stationary

The target $y_{s, a, θ}$ is the target we want $Q (s, a; θ)$ to predict. However, the target depends on the weight $θ$ , when we update the weights in every iteration, our target is also changing

This creates a situation: We are chasing a target that is forever moving

Solution: Fixed Q-Targets

We define target network $θ^{-}$ for evaluation

make $θ^{-} = θ$
$θ^{-}$ stay fixed for few training step
sync $θ^{-}$ and $θ$ again

This way, for the steps that $θ^{-}$ are frozen, network target stay fixed

How to sample batches of data for training?

TBD I have no clue what the hell is this problem doing

Chilfox

目錄

D-DL4CV-Lec21e-Deep_Q-Learning