Deep Q-Learning

Approach

We want to train a neural network with weight that approximate optimal Q-function

Evaluating the Network

The target can be calculated by:

then we can use it to define the loss

Why we define the loss this way?

In Bellman Equation, we say when we reach , then

Hence, when reach ,


Problem

Target Non-Stationary

The target is the target we want to predict. However, the target depends on the weight , when we update the weights in every iteration, our target is also changing

This creates a situation: We are chasing a target that is forever moving

Solution: Fixed Q-Targets

We define target network for evaluation

  1. make
  2. stay fixed for few training step
  3. sync and again

This way, for the steps that are frozen, network target stay fixed

How to sample batches of data for training?

TBD I have no clue what the hell is this problem doing