Introduction

Definition

MDP is like a mathematical blueprint that captures all the necessary information for the network to make decision. It can be express with a tuple

Explanation

: The Set of Possible States

States represent every situation that the agent might find itself in

: The Set of Possible Actions

Actions represent all the possible choices available for the agent

: The Reward Distribution

When the agent takes an action, the reward will be sampled from

: The Transition Distribution

When we transit from to by . is not fixed. Instead, it is sampled from

: The Discount Factor

How much should the future rewards matter compared to immediate ones in current decision making?

The discount factor encodes this tradeoff

Markov Property

The current state completely describe the current world. The agent only makes decision based on and has no need to take history into consideration

Process

Policy : The Action Probability Distribution

The agent samples an action from policy

Steps

  • At time step , environment samples initial state
  • Then, repeat till done:
    • Agent selects an action
    • Environment samples reward
    • Environment samples next state
    • Agent receives reward and next state