Reinforcement Learning

Introduction

An agent performs action in environment, and receive rewards based on the action it took. The agent will update its policy to get more reward

Process

Repeat:

The agent sees a state $s_{t}$ from the environment
It makes an action $a_{t}$ based on what is sees
The agent receives reward $r_{t}$ based on $(s_{t}, a_{t})$

Difference against Supervised Learning

1. Stochasticity

Though we take the same action at the same state, the state transition and the reward may be different. The reward and state transition aren’t fixed, they are sample from probability distributions

For example: When a robot decides to turn right, a wind blow may affect the next state

2. Credit Assignment

In chess, we’ll gain reward after we win the game. However, it’s very hard for the network to determine what step it took during the game leads to the win

3. Non-Differentiable

The reward and state transition are sampled from probability distributions because the world are ever-changing and not predictable. Hence, we aren’t able to backpropagate through the world since we can’t memorize the complete state of every moment of the world

4. Non-Stationary

Non-stationary refers to situations where environment change over time, violating our initial assumption

Difference between stochasticity and non-stationary

Stochasticity: Environment change over time, but it is included in our state by probability distribution Non-stationary: Environment change over time, but the change is surprising and not included in our probability distribution

Markov Decision Process (MDP)

MDP is a mathematical blueprint that describe everything the agent needs to know to make decisions

D-DL4CV-Lec21a-MDP

Finding Optimal Policy $π^{⋆}$ with Q-Function

Goal

We want to find optimal policy $π^{⋆}$ that maximize discounted sum of rewards. However, there are lots of randomness during the process (lots of sampling), thus we choose to maximize the expected sum of rewards

π^{⋆} = ar g π max E [t \geq 0 \sum γ^{t} r_{t} ∣ π]

Value Function and Q-Function

Value Function and Q function helps us compute the expected cumulative rewards based on current situations

D-DL4CV-Lec21b-Value_Function D-DL4CV-Lec21c-Q-Function

Bellman Equation

Optimal Q-Function

Q^{⋆} (s, a) = π max E [t \geq 0 \sum γ^{t} r_{t} ∣ s_{0} = s, a_{0} = a, π]

Optimal Q-function $Q^{⋆} (s, a)$ is the Q-function for the optimal policy $π^{⋆}$ . It gives the max possible future reward when taking action $a$ at state $s$

If we find $Q^{⋆}$ , we can then get $π^{⋆}$ since $Q^{⋆}$ contain all the information we need to find $π^{⋆}$

Bellman Equation

$Q^{⋆}$ will satisfy the following recurrence relation:

Q^{⋆} (s, a) = E_{r, s^{'}} [r + γ a^{'} max Q^{⋆} (s^{'}, a^{'})], where r \sim R (s, a), s^{'} \sim P (s, a)

Finding Optimal $Q^{⋆}$

Value Iteration Convergence

D-DL4CV-Lec21d-Value_Iteration_Convergence

Problem: For every $Q_{i}$ , we need to memorize optimal choice for every state $s$ . Hence, when there are infinite states, we’ll need infinite memory, which is impossible

Solution: We’ll try to approximate $Q (s, a)$ with a neural network, and use Bellman Equation as loss

Deep Q-Learning

D-DL4CV-Lec21e-Deep_Q-Learning

Finding Optimal Policy $π^{⋆}$ with Policy Gradient

Policy Gradient

Approach

Train a network $π_{θ} (a ∣ s)$ that takes state as input and give distribution over the action took in the state

Objective Function

Expected future rewards when following policy $π_{θ}$ :

J (θ) = E_{r \sim p_{θ}} [t \geq 0 \sum γ^{t} r_{t}]

We can find optimal policy by using gradient ascent: $θ^{⋆} = ar g max_{θ} J (θ)$

Problem: Gradient $\partial J / \partial θ$ Calculation

\frac{\partial J}{\partial θ} = \frac{\partial}{\partial θ} E_{r \sim p_{θ}} [t \geq 0 \sum γ^{t} r_{t}]

When we change the weight, we’ll change the reward distribution, which makes the entire computation super complex

Hence, we need to come up with a way to overcome this computation

REINFORCE algorithm

D-DL4CV-Lec21f-REINFORCE_Algorithm

Chilfox

目錄

D-DL4CV-Lec21-Reinforcement_Learning

Reinforcement Learning

Introduction

Process

Difference against Supervised Learning

1. Stochasticity

2. Credit Assignment

3. Non-Differentiable

4. Non-Stationary

Markov Decision Process (MDP)

Finding Optimal Policy $π^{⋆}$ with Q-Function

Goal

Value Function and Q-Function

Bellman Equation

Optimal Q-Function

Bellman Equation

Finding Optimal $Q^{⋆}$

Value Iteration Convergence

Deep Q-Learning

Finding Optimal Policy $π^{⋆}$ with Policy Gradient

Policy Gradient

Approach

Objective Function

Problem: Gradient $\partial J / \partial θ$ Calculation

REINFORCE algorithm

關係圖譜

反向連結

Chilfox

目錄

D-DL4CV-Lec21-Reinforcement_Learning

Reinforcement Learning

Introduction

Process

Difference against Supervised Learning

1. Stochasticity

2. Credit Assignment

3. Non-Differentiable

4. Non-Stationary

Markov Decision Process (MDP)

Finding Optimal Policy π⋆ with Q-Function

Goal

Value Function and Q-Function

Bellman Equation

Optimal Q-Function

Bellman Equation

Finding Optimal Q⋆

Value Iteration Convergence

Deep Q-Learning

Finding Optimal Policy π⋆ with Policy Gradient

Policy Gradient

Approach

Objective Function

Problem: Gradient ∂J/∂θ Calculation

REINFORCE algorithm

關係圖譜

反向連結

Finding Optimal Policy $π^{⋆}$ with Q-Function

Finding Optimal $Q^{⋆}$

Finding Optimal Policy $π^{⋆}$ with Policy Gradient

Problem: Gradient $\partial J / \partial θ$ Calculation