Simple Conclusion
Statement
Let be the sequence of states and actions we get when following policy which is sampled from
Then
and we can calculate the gradient by
Intuition
When is high, increase the probability of the action we took, and vice versa
Steps
- Initial random weights
- Repeat:
- Collect trajectories and reward using policy
- Compute
- Gradient ascent on
Derivation
Objective
We want to find expression for which is able to be computed effectively
Derivation
Revise Original Expression
The second to last ”=” comes from:
Compute
is impossible to compute because transition is decided by the environment, which we can’t backpropagate
However, surprisingly, we’ll eliminate this term when computing gradient
Conclusion
Putting the two derivations together we have: