Introduction
Definition
MDP is like a mathematical blueprint that captures all the necessary information for the network to make decision. It can be express with a tuple
Explanation
: The Set of Possible States
States represent every situation that the agent might find itself in
: The Set of Possible Actions
Actions represent all the possible choices available for the agent
: The Reward Distribution
When the agent takes an action, the reward will be sampled from
: The Transition Distribution
When we transit from to by . is not fixed. Instead, it is sampled from
: The Discount Factor
How much should the future rewards matter compared to immediate ones in current decision making?
The discount factor encodes this tradeoff
Markov Property
The current state completely describe the current world. The agent only makes decision based on and has no need to take history into consideration
Process
Policy : The Action Probability Distribution
The agent samples an action from policy
Steps
- At time step , environment samples initial state
- Then, repeat till done:
- Agent selects an action
- Environment samples reward
- Environment samples next state
- Agent receives reward and next state