Reinforcement Learning
Introduction
An agent performs action in environment, and receive rewards based on the action it took. The agent will update its policy to get more reward
Process
Repeat:
- The agent sees a state from the environment
- It makes an action based on what is sees
- The agent receives reward based on
Difference against Supervised Learning
1. Stochasticity
Though we take the same action at the same state, the state transition and the reward may be different. The reward and state transition aren’t fixed, they are sample from probability distributions
For example: When a robot decides to turn right, a wind blow may affect the next state
2. Credit Assignment
In chess, we’ll gain reward after we win the game. However, it’s very hard for the network to determine what step it took during the game leads to the win
3. Non-Differentiable
The reward and state transition are sampled from probability distributions because the world are ever-changing and not predictable. Hence, we aren’t able to backpropagate through the world since we can’t memorize the complete state of every moment of the world
4. Non-Stationary
Non-stationary refers to situations where environment change over time, violating our initial assumption
Difference between stochasticity and non-stationary
Stochasticity: Environment change over time, but it is included in our state by probability distribution Non-stationary: Environment change over time, but the change is surprising and not included in our probability distribution
Markov Decision Process (MDP)
MDP is a mathematical blueprint that describe everything the agent needs to know to make decisions
Finding Optimal Policy with Q-Function
Goal
We want to find optimal policy that maximize discounted sum of rewards. However, there are lots of randomness during the process (lots of sampling), thus we choose to maximize the expected sum of rewards
Value Function and Q-Function
Value Function and Q function helps us compute the expected cumulative rewards based on current situations
D-DL4CV-Lec21b-Value_Function D-DL4CV-Lec21c-Q-Function
Bellman Equation
Optimal Q-Function
Optimal Q-function is the Q-function for the optimal policy . It gives the max possible future reward when taking action at state
If we find , we can then get since contain all the information we need to find
Bellman Equation
will satisfy the following recurrence relation:
Finding Optimal
Value Iteration Convergence
D-DL4CV-Lec21d-Value_Iteration_Convergence
Problem: For every , we need to memorize optimal choice for every state . Hence, when there are infinite states, we’ll need infinite memory, which is impossible
Solution: We’ll try to approximate with a neural network, and use Bellman Equation as loss
Deep Q-Learning
D-DL4CV-Lec21e-Deep_Q-Learning
Finding Optimal Policy with Policy Gradient
Policy Gradient
Approach
Train a network that takes state as input and give distribution over the action took in the state
Objective Function
Expected future rewards when following policy :
We can find optimal policy by using gradient ascent:
Problem: Gradient Calculation
When we change the weight, we’ll change the reward distribution, which makes the entire computation super complex
Hence, we need to come up with a way to overcome this computation