Deep Q-Learning
Approach
We want to train a neural network with weight that approximate optimal Q-function
Evaluating the Network
The target can be calculated by:
then we can use it to define the loss
Why we define the loss this way?
In Bellman Equation, we say when we reach , then
Hence, when reach ,
Problem
Target Non-Stationary
The target is the target we want to predict. However, the target depends on the weight , when we update the weights in every iteration, our target is also changing
This creates a situation: We are chasing a target that is forever moving
Solution: Fixed Q-Targets
We define target network for evaluation
- make
- stay fixed for few training step
- sync and again
This way, for the steps that are frozen, network target stay fixed
How to sample batches of data for training?
TBD I have no clue what the hell is this problem doing