Simple Conclusion

Statement

Let $x = (s_{0}, a_{0}, s_{1}, a_{1}, \dots)$ be the sequence of states and actions we get when following policy $π_{θ}$ which is sampled from $x \sim p_{θ} (x)$

Then

J (θ) = E_{x \sim p_{θ}} [f (x)]

and we can calculate the gradient by

\frac{\partial J}{\partial θ} = \frac{\partial}{\partial θ} E_{x \sim p_{θ}} [f (x)] = E_{x \sim p_{θ}} [f (x) t \geq 0 \sum \frac{\partial}{\partial θ} lo g π_{θ} (a_{t} ∣ s_{t})]

Intuition

When $f (x)$ is high, increase the probability of the action we took, and vice versa

Steps

Initial random weights $θ$
Repeat:
- Collect trajectories $x$ and reward $f (x)$ using policy $π_{θ}$
- Compute $\partial J / \partial θ$
- Gradient ascent on $θ$

Derivation

Objective

We want to find expression for $\partial J / \partial θ$ which is able to be computed effectively

Derivation

Revise Original Expression

\frac{\partial J}{\partial θ} = \frac{\partial}{\partial θ} E_{x \sim p_{θ}} [f (x)] = \frac{\partial}{\partial θ} \int_{X} p_{θ} (x) f (x) d x = \int_{X} f (x) \frac{\partial}{\partial θ} p_{θ} (x) d x = \int_{X} f (x) p_{θ} (x) \frac{\partial}{\partial θ} lo g p_{θ} (x) d x = E_{x \sim p_{θ}} [f (x) \frac{\partial}{\partial θ} lo g p_{θ} (x)]

The second to last ”=” comes from:

\frac{\partial}{\partial θ} lo g p_{θ} (x) = \frac{1}{p _{θ} ( x )} ⟹ \frac{\partial}{\partial θ} p_{θ} (x) = p_{θ} (x) \frac{\partial}{\partial θ} lo g p_{θ} (x)

Compute $\frac{\partial}{\partial θ} lo g p_{θ} (x)$

⟹ p_{θ} (x) = t \geq 0 \prod P (s_{t + 1} ∣ s_{t}, a_{t}) π_{θ} (a_{t} ∣ s_{t}) lo g p_{θ} (x) = t \geq 0 \sum (lo g P (s_{t + 1} ∣ s_{t}, a_{t}) + lo g π_{θ} (a_{t} ∣ s_{t}))

$P (s_{t + 1} ∣ s_{t}, a_{t})$ is impossible to compute because transition is decided by the environment, which we can’t backpropagate

However, surprisingly, we’ll eliminate this term when computing gradient

\frac{\partial}{\partial θ} lo g p_{θ} (x) = t \geq 0 \sum \frac{\partial}{\partial θ} lo g π_{θ} (a_{t} ∣ s_{t})

Conclusion

Putting the two derivations together we have:

\frac{\partial J}{\partial θ} = \frac{\partial}{\partial θ} E_{x \sim p_{θ}} [f (x)] = E_{x \sim p_{θ}} [f (x) t \geq 0 \sum \frac{\partial}{\partial θ} lo g π_{θ} (a_{t} ∣ s_{t})]

Chilfox

目錄

D-DL4CV-Lec21f-REINFORCE_Algorithm

Simple Conclusion

Statement

Intuition

Steps

Derivation

Objective

Derivation

Revise Original Expression

Compute $\frac{\partial}{\partial θ} lo g p_{θ} (x)$

Conclusion

關係圖譜

反向連結

Chilfox

目錄

D-DL4CV-Lec21f-REINFORCE_Algorithm

Simple Conclusion

Statement

Intuition

Steps

Derivation

Objective

Derivation

Revise Original Expression

Compute ∂θ∂​logpθ​(x)

Conclusion

關係圖譜

反向連結

Compute $\frac{\partial}{\partial θ} lo g p_{θ} (x)$